How to implement LSI technique for my site

jameswatt Peon

Messages:: 40

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

I heard that Google is adapting LSI technique in search engine algorithm to display relevant data.

Following are my query regarding the LSI

How LSI is more effective than popular search engine technique?
How SVD statistical data works with matrix?
Can any one tell me steps to implement LSI to my site?At present LSI has less weight in Google algorithm but what is the possibility of increasing that weight age for G, Y and Msn?

jameswatt, Sep 16, 2006 IP

Domen Lombergar Peon

Messages:: 106

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 0

#2

LSI is getting more and more implemented into the SERP algorythms since it's a very good way to establish what a site is about when you have phrases which could mean a lot of different things.

It's not a replacement for classic SEO techniques though, just something extra.

What I would suggest is to check a dictionary or an encyclopedia and seek some very similar words for the current targeted keywords. Then just implement that into the text.

Domen Lombergar, Sep 16, 2006 IP

jameswatt Peon

Messages:: 40

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

What I would suggest is to check a dictionary or an encyclopedia and seek some very similar words for the current targeted keywords. Then just implement that into the text.[/QUOTE said:

i have read many documents regarding LSI but still confuse to apply matrix formula, is LSI is all about to use related words of your keyword..??? than for example my keywors is web design what will be the synonyms words for that
Click to expand...

jameswatt, Sep 19, 2006 IP

egarcia Peon

Messages:: 9

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#4

I hope this help.

Talking about LSI in SEO forums is not the same as explaining stepwise how-to calculations leading to the analysis, so IR studenst and marketers can replicate these and understand how things are done.

There are many "LSI based" snake oil sellers these days claiming to provide LSI services. It is up to one to fall into their tricks. I have visited several forums to chase away these type of marketers, which are giving a black eye to the rest of the SEO industry.

Regarding how to implement LSI, I have written a comprehensive tutorial series on SVD and LSI. The latest article is available at

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html

Note this is a tutorial series mainly for IR students, so you need a linear algebra background if you wish to replicate all calculations, step by step. In addition I assume readers have assimilated previous parts (1-3) of the series.

There is also a matrix tutorial and several fast track tutorials designed as references for the series. there are references to resources using code lines for playing with LSI. I might provide barebone code lines on SVD and LSI in the near future.

Hope this help.

Cheers

Dr. E. Garcia

egarcia, Sep 22, 2006 IP

onestop likes this.

ketan9 Active Member

Messages:: 548

Likes Received:: 9

Best Answers:: 0

Trophy Points:: 58

#5

I know LSI In engineering terms but I am not quite sure how it applies to Search Engines? Can someone explain what is LSI for search engines???

ketan9, Sep 22, 2006 IP

egarcia Peon

Messages:: 9

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#6

ketan9 said: ↑

Can someone explain what is LSI for search engines???
Click to expand...

Well, this is explained in the above tutorial series, which discusses what is/is not LSI, so one can get out of the head several SEO misconceptions. Part 1 exposes the many SEO myths around in

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html

You might want to read in sequence these articles, as cutting corners is not recommended. Note that to replicate calculations a linear algebra background is required.

In part 4 of the tutorial is explained that in LSI one scores terms according to a predetermined scoring system and construct a term-document matrix. This is decomposed into three new matrices. These are then truncated to remove noisy dimensions.

The so-called right eigenvectors of the SVD are document vector coordinates and the so-called left eigenvectors are term vector coordinates.

The query vector coordinate is calculated as described in the tutorial.

From there one can compute query-document similarities and rank documents in descending order of similarity values.

The right eigenvectors can be used for doc-doc clustering and the left eigenvectors can be used for term-term clustering. In each case cosine similarity values are computed as is normally done with Term Vector models.

Such clustering opens the door to several IR studies such as automatic thesaurus construction, keywords and documents classification, etc.

Cheers

Hope this help

Dr. E. Garcia

egarcia, Sep 22, 2006 IP

axemedia Guest

Messages:: 1,070

Likes Received:: 79

Best Answers:: 0

Trophy Points:: 0

#7

ok, Dr. E. Garcia

Now lets hear how it applies in real world plain english, for those of us who write content and attempt to optimize our content pages.

Should we just mix up the the terms we are using. Add some occurences of related terms to the main term we may be targeting on a page?

axemedia, Sep 22, 2006 IP

egarcia Peon

Messages:: 9

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#8

axemedia said: ↑

Should we just mix up the the terms we are using. Add some occurences of related terms to the main term we may be targeting on a page?
Click to expand...

Nope.

Actually, as Grosmann and Frieder mentioned in their book "Information Retrieval Algorithms and Heuristics" that we referenced in the series the key to term similarity is not that terms happen to occur in documents, but that these co-occur with the same neighboring terms. This is called contextuality.

In addition, consider this: in the old LSI literature the term-document matrix (A) to be decomposed was defined using a Term Count model. This was used to score term weights a occurrences. So, entries of the A matrix were mere word counts. This served a purpose back then. SEOs are still quoting those LSI papers not realizing this.

These days LSI term weights are scored using local, global and normalization weights, not mere word occurrences. We also explained how even if terms co-occur in documents LSI can fail to assess contextuality.

In fact, LSI models based only on the old implementation (word occurrences only) are easy to deceive. We explained how this is done in the tutorial and how spammers are trying to game LSI scoring systems.

The best approach is to write articles as natural as possible. An HTML DOM structured approach helps. You can use terms as long as these are on-topic, but not merely adding synonyms that look "forced".

Dr. E. Garcia (aka orion)

egarcia, Sep 22, 2006 IP

seo-mumbai Well-Known Member

Messages:: 2,004

Likes Received:: 183

Best Answers:: 0

Trophy Points:: 105

#9

yes, but if u just mix terms we need luck, as he explained if we take time and use them correctly then we get good results.

seo-mumbai, Sep 22, 2006 IP

jaguar-archie2006 Banned

Messages:: 631

Likes Received:: 16

Best Answers:: 0

Trophy Points:: 0

#10

more details about LSI. http://en.wikipedia.org/wiki/Latent_semantic_indexing

jaguar-archie2006, Sep 22, 2006 IP

axemedia Guest

Messages:: 1,070

Likes Received:: 79

Best Answers:: 0

Trophy Points:: 0

#11

I was being far too general when i said "mix up some terms", in response to his being far too technical (for me at least).

Is my limited understanding correct? For better rankings for a certain term, the document should have some other related terms that may be frequently associated with the target term. Other pages on the site (linked to that page) should have a few occurances of other associated terms. Back links from other sites should have targeted term, and/or associated terms, on that page and/or in anchor text.

Of course some serious thought and research must be done to discover some of those associated terms. A search engine would of course be creating this matrix from the millions, or billions of page in its index. From this master matrix it weights term occurences at a much broader level in an attempt to gain insight into contextuality of associated terms (words)

axemedia, Sep 22, 2006 IP

sipltech Well-Known Member

Messages:: 713

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 130

#12

What I learnt!! Never discount the learning curve, the more you notice, you respond to the many subtle opportunities. (one from my quote list)

I have noticed Dr. E. Garcia (aka orion) The famous Scientist from Israel Wow!!!!!! Its sheer luck you have taken your valuable time and joined this community. I have numerous confusion and curiosity.

I understand that there is a lot of difference between those who optimize the pages and those who design the search algorithm. But an SEO should work on what ist the best suitable method that helps his clients.

What I understand SEâ€™s pickup natural keywords, which are not forced. Even we use Synonyms or similar word it should be on natural flow. One should avoide the deliberated use or keyword stuffing. Correct me please!

Relevancy rules, For Example:

I have tested a keyword â€œcalculatorâ€ and â€œ~calculatorâ€ and the results were exciting!! Google returned the query with different results, i.e. with calculator it shows different types of calculator and with â€œ~calculatorâ€ it shows many variations i.e. Conversions, counters, Converters and many more. Similarly when I search for â€œ~converterâ€ it shows site similar to this. This also means the Adword users will get more prominent exposure of their ads and it will be more relevant.

When I use 2 words i.e. â€œ~currency converterâ€ and â€œcurrency converterâ€, it highlight the â€œExchangeâ€ rates too, which is similar to â€œcurrency converterâ€

So with given combination of two keywords or key phrases, what will be the optimum combination, Co-occurrence and how we should calculate that?

I have limited knowledge, but like to learn more about these results, can you please explain Orion?

I am also confused how Search engines are seeing the Article directories, each and every one has similar article. An article distributed results N numbers of duplicate pages. ïŒ

Thanks in advance. And sorry for my bad grammar

sipltech, Sep 23, 2006 IP

egarcia Peon

Messages:: 9

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#13

First things first. Hey, Shawn, congratulations for the dp forums. I always wanted to post, but did not find the time. For those don't know who is Shawn (I doubt it) he is the owner of DigitalPoint. I meet him back in San Diego when we get together to have lunch and talk SEO, search engines and business.

Second, thanks for the comparison, sipltech, but I think you are mistaking me for Ori Allon from Israel and creator of the Orion search engine. I posted about that engine at SEWF before Google bought it.

I'm Orion the moderator of the search technology, relevancy and beta test sections of searchenginewatch forums, though I haven't posted anything new there in months. Those familiar with SEWF know about my reaching out efforts within both SEO and IR circles. In a recent interview for Mike Grehan's ClickZ column I explained to him that the whole idea of posting at forums for the mere sake of having a say in a discussion is not for me and that I will post only when necessary. http://www.miislita.com/interviews/unedited-interview-with-dr-garcia.pdf

The recent crew of SEO firms claiming to sell "LSI based" SEO services along with the spreading of LSI myths by certain SEO firms is making me to chase away those folks across several forums (sorry, Shawn for mentioning these forums by names: Cre8asiteforums, SEOMOZ and SEOChat, among few others). Luckly at SEWF those fallacies have been dispelled long ago. About those marketers, these are firms that with their "LSI services" are making a caricature out of LSI research, for not saying a mockery. In the process the industry is getting a black eye by these snake oil sellers. My experience with some of these is that they like to send one of two folks to forums to play pitcher and catcher. While generalizations are risky and no point intended for the generator of this thread, but haven't noticed that most of their threads at several forums start with almost the same title or naive questions? Looks to me like a copy/paste drive-by shooting effort. They are all over the place.

Regarding optimization, my opinion is this: if it's not broken, don't fix it. If an optimization strategy works great for your clients, stick to it. Why change what is not broken?

Regarding, word occurrences in documents. In the first LSI papers term-document matrices were constructed using the Term Count model, where local weights (L) were assigned to terms in documents using mere word counts or term frequencies (tf); i.e

Equation 1: w = L = tf

For terms in a query, a similar scoring scheme was adopted.

Thus, the LSI matrix inherited several limitations from this model. Here is one of many: long documents tend to have more instances of words. Thus, the model is biased toward long documents since these tend to have more word occurrences. These score high simply because they are longer, not because they are relevant. I'm not sure is valid to quote those papers to make correlations with current search results or to equate word occurrences to contextuality or to current implementations of LSI at all.

Some papers from the early 2000's used Equation 1 in collections where documents are of similar size, format, and are under controlled conditions. Example of these are titles, abstracts and regulated exams like TOEFL, GRE, SAT, etc or when the collections are relatively small and prestructured. Some of these papers are published to have a research baseline; others are mere idealization of conditions and do not consider spam noise from unstructured collections.

Soon after the first LSI models were published, the very same researchers realized that they could do better by incorporating global (G) and normalization (N) weights into term weights (w). Advanced models incorporate the entropy (E) of a term in a collection. Thus, most current LSI models are not based on Equation 1, but on Equation 2:

Equation 2: w = L*G*N

How one defines L, G and N impacts search results; albeit that word occurence-only models are easy to game and that these do not grant contextuality. Why then the term count model is taught at CS schools? This is done to introduce IR students to basics LSI and Term Vector models. Once they learn the basics, they can move forward and learn about advanced models, make comparisons and draw conclusions. Since my target audience are IR students and search marketers, most of these new to IR and LSI, I -unfortunately- have to use primitive occurrence models to explain term vector theory and LSI models. Later on in those discussions I cover models based on Equation 2 and on entropy models.

Regarding the use of "~" or other query operators, query operators are not part of the SVD algorithm used in LSI. In fact, one should be able to SVD a term-document small matrix with a basic matrix calculator that does SVD decomposition. LSI query operators? This the first of a list of 25 common myths enumerated in Part 1 of the above tutorial. "~" is used by some systems as a "find related terms" operator, but this feature can be implemented with any system that has a custom made or built-in thesaurus and does not use LSI at all. You just need a reference list of synonyms. This list can be constructed in many different ways, not just with the left eigenvectors of an LSI matrix.

Cheers

Dr. E. Garcia

egarcia, Sep 23, 2006 IP

ChrisChoi Peon

Messages:: 51

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#14

Thank you for referring me to this post.

I read before that each word has a numerical value from people
who've talked to Google Engineers in SEO conventions.

However, I don't think the average person can read this and
walk away with knowing how to use themed & related keywords
with their content.

This is really complicated stuff.

ChrisChoi, Sep 24, 2006 IP

sedohr Peon

Messages:: 5

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#15

To get good related LSI terms for my content pages, I use Wordtracker's Keyword Universe search.

Leave the Lateral and Thesaurus checkboxes checked and enter the keyword phrase - for instance; "golf equipment"

The results you see when you scroll down are a very good list of LSI words and phrases to use with your main keyword phrase.

Hope this helps,
Randy

sedohr, Sep 25, 2006 IP

Kaudo Well-Known Member

Messages:: 358

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 118

#16

ChrisChoi said: ↑

This is really complicated stuff.
Click to expand...

Chris, donÂ´t loose your time.
Get back to this topic in 3 years; if your site is among the top 10 searched and needs the LSI differentiation.

ThereÂ´s lot of things you can focus on meanwhile.

Kaudo, Sep 25, 2006 IP

egarcia Peon

Messages:: 9

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#17

Kaudo said: ↑

Chris, donÂ´t loose your time....
...
ThereÂ´s lot of things you can focus on meanwhile.
Click to expand...

I might not be able to follow this thread any further, but feel free to keep it alive with your fine comments.

In my opinion, your comments are one of the most sounded advices of this thread. If not broken, don't fix it.

Regarding the use of terms conveying word relatedness: The use of synonyms and related terms to improve a topic is just common sense and a writing/readability practice used by professional writers for centuries. This is a practice one should adopt, but not because back in 1988 some IRs applied Golub's 1965 SVD algorithm to a vocabulary problem and called that LSI (or LSA, if you wish). For the record, SVD as a dimensionality reduction technique has been applied for many years in engineering, physical and natural sciences (e.g. chemistry, medicine, biology) and even before 1988. Do a search in Google to convince yourself. SVD is one of many dimensionality reduction techniques available.

SVD when used as in LSI is a great tool for identifying terms responsible for inducing similarity in documents, when even these are not explicitly present. This effect is hidden (latent) and becomes evident after removing noisy dimensions. In the early LSI literature this effect was emphasized, commented and described, but not fully addressed. A lot of emphasis was given to the ability of LSI to map terms to concepts, without fully addressing the role of high-order co-occurrence. However, where does this effect comes from? This is still a hot research topic.

Many SEOs are misquoting such old papers and the focus of that old research. Many of these SEO "experts" don't even know how to do basic SVD decomposition nor they understand the how-to steps involved in computing LSI scores. In the process they have stretched such research findings and added few of their own myths, in order to market better whatever they sell. For instance, today one can see some suggesting that to have documents "LSI friendly" one needs to stuff content with synonyms or related terms. This perception is incorrect.

A lot of research have been done since then. Current LSI research suggests that what makes LSI work is not that terms occurings in documents happen to be synonyms or related terms. What seem to be at the heart of LSI and makes the technique works is the presence of high-order co-occurrence patterns present in the same context in the reduced SVD space. Such connectivity paths can be extracted from a term-term co-occurrence matrix obtained from the SVD algorithm. Terms involved in these paths do not have to be synonyms. These connectivity paths seem to be responsible for inducing similarity in documents. New research efforts and advances in the field are focused now in proposing a theoretical model for such behavior.

Cheers

Dr. E. Garcia

egarcia, Sep 26, 2006 IP

Kaudo Well-Known Member

Messages:: 358

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 118

#18

Thank you, Sir.

Kaudo, Sep 26, 2006 IP

egarcia Peon

Messages:: 9

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#19

I referenced this DigitalPoint thread at this cre8asiteforums:

http://www.cre8asiteforums.com/forums/index.php?showtopic=37247&pid=202936&st=20&#entry202936

and explained why there is no such thing as "LSI-Friendly" documents.

This is just another SEO Myth.

Don't be gamed by marketing firms sending out dumb emails. These have been exposed for what they are and for how much they know about search engines.

Dr. E. Garcia

egarcia, Oct 21, 2006 IP

arizzt3049 Peon

Messages:: 3

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#20

If I want to do research for my LSI optimized page, beside Wordtracker is there any other keyword research tools that can help me? I prefer free tools though since I'm just starting out.

arizzt3049, May 12, 2007 IP

Log in or Sign up

How to implement LSI technique for my site

jameswatt Peon

Domen Lombergar Peon

jameswatt Peon

egarcia Peon

ketan9 Active Member

egarcia Peon

axemedia Guest

egarcia Peon

seo-mumbai Well-Known Member

jaguar-archie2006 Banned

axemedia Guest

sipltech Well-Known Member

egarcia Peon

ChrisChoi Peon

sedohr Peon

Kaudo Well-Known Member

egarcia Peon

Kaudo Well-Known Member

egarcia Peon

arizzt3049 Peon

Useful Searches