Friday, June 03, 2011

The search for hidden meanings

Throughout written history, people have engaged in finding the hidden meaning in writing.

Fascination at the hieroglyphs on the walls ancient Egyptian temples and burial sites extends back well before  4 PM on November 26, 1922 when Howard Carter’s search for hidden meanings resulted in the discovery of the 3300 year old and untouched tomb of 19 years old king Tutankhamun .

Today, we are even more fascinated with exploring our written (and spoken) language.

And it all comes down to what is known as Part-of-speech tagging (POS tagging or POST).

Most of us have done it at school by identifying words as nouns, verbs, adjectives, adverbs, etc.

Back when the Beatles were at their peak, America and its allies were embroiled in the Vietnam war, Dr Christiaan Barnard carried out the world's first human heart transplant and The Six Day War was fought in the Middle East, NASA launched an unmanned Apollo 4 test spacecraft and Britons got their first colour television programmes . But in that same year the one development that affects more people today and will do in the future is the work of Henry Kucera and W. Nelson Francis.  They published their classic work Computational Analysis of Present-Day American English (1967), which provided basic analysis about words in texts on what is known today simply as the Brown Corpus.

Henry Kucera and W. Nelson Francis did more complicated analysis than getting computers to find nouns and verbs but the principle is the same. It is a process largely based on relationships with adjacent and related words in a phrase, sentence, or paragraph. 

Once performed by hand, POS tagging is now done in the context of the son of the Brown Corpus, computational linguistics. It uses algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags or forms of description or, more recently, that are created as they are found ‘on the fly’.

The reason that Kucera and  Francis work is so important is that we have built a whole new form of society on this idea.

Clever scientists have used this idea of extracting hidden meaning to develop a new form of internet.

One of these ideas came from three academics Scott Deerwester, Susan  Dumais, George Furnas, Thomas Landauer and Richard Harshman (1990). They outlined how to analyse relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Called Latent semantic Analysis (LSA), the idea assumes that words that are close in meaning will occur close together in text.  

This idea is used by all manner of analysis programmes and helps find those hidden meanings.

In their paper they says “...Thus while LSA’s potential knowledge is surely imperfect, we believe it can offer a close enough approximation to people’s knowledge to underwrite theories and tests of theories of cognition.”  Since 1990, academics have come a long way and accuracy is getting ever closer to social reality. 

Today, the use of semantics makes the Google and Bing web search algorithms more accurate, helps newspaper journalists find the most authoritative sources for information and informs the top companies about events and their drivers to optimise financial, marketing and communication decisions.

Remember Kristen Urbahn’s story I blogged about three weeks ago? It has lots of hidden meanings. Using Extractive.com’s special search engine Kristen can find out about the relationships between different parts of the story (using automated Part of Speech tagging).

The results show the nature of some of the significant words:

PERSON (46)
│├SCREEN ACTOR (4)
││└ Kathy Griffin (4)
  she
││  her 
│├US CABINET MEMBER (1)
││└ Donald Rumsfeld
│├US PRESIDENT (3)
││└ Obama (3)
││  Obama
││  Obama
││  his 
│├ Brian Williams
│├ Dan Pfeiffer
│├ Jill Jackson (2)
  Jill Jackson
│├ Keith (6)
  Keith Urbahn
  He
 │├ Kristen Urbahn (13)
  Kristen Urbahn
   her
  Kristen Urbahn
 
  Kristen 
│├ Maggie Fox
│├ Osama Bin Laden (6)
  Osama Bin Laden
  Bin Laden 
│├ Osama Bin Ladin (5)
  Osama Bin Ladin
  He
  Osama
  he 
│└ Sohaib Athar



LOCATION (14)
│├GPE (13)
││├COUNTRY (5)
│││├ Afghanistan
│││├ Pakistan (2)
│││  Pakistan 

│││└ US (2)
││├CITY (4)
│││├ Abbottabad
│││├ Denver
│││├ Guardian
│││└ San Francisco
││└US STATE (4)
││  Kansas
││  South Carolina
││  Washington (2)
││   Washington
││   Washington 
│└ Wiltshire
ORGANIZATION (21)
│├COMMERCIAL ORG (16)
││├MEDIA ORG (7)
│││├BROADCAST NETWORK (5)
││││└TV NETWORK (5)
││││  BBC
││││  CBS
││││  CNN (2)
││││  NBC
│││├ New York Times
│││└ Washington Times
││├ Defence
││├ Google
││├ Social Media Group

││└ Twitter (5)
││  Twitter 

│├NON GOVERNMENT ORG (2)
││├ Al Qaeda
││└ Republican Leaders Office
│└UNIVERSITY (3)
  Preston University
  University of Kentucky
  Yale
CONTACT INFO (1)
│└URL (1)
 HTTP (1)
  http://goo.gl/qHnFH

OTHER (18)
│├FACILITY (4)
││└BUILDING (4)
││  White House (4)
││   White House Communication Director
││   White House 

│├LINKED OTHER (11)
││├ Capitol Hill
││├ Christian
││├ Creative Commons
││├ Dachshunds
││├ Internet
││├ Internet
││├ Mobile
││├ POTUS
││├ President Obama
││├ Royal Wedding
││└ The New York Times
│└SOFTWARE (3)
  Facebook (3)
   Facebook 

DATE-TIME (16)
│├DATE GENERAL (8)
││├DATE (2)
│││├ Aug. 18, 2009
│││└ May 1 2011
││├DAY OF MONTH (1)
│││└ 1 May
││├MONTH NAME (1)
│││└ May
││├RELATIVE DATE (2)
│││├ months ago
│││└ the evening
││└YEAR (2)
││  2006
││  2011

│└TIME (8)
  10:30 p.m. Eastern Time
  10:40 p.m.
  10:53
  11 p.m.
  11:35
  4pm EST
  9:45 p.m.
  from 10:45 p.m.-2:20 a.m.
NUMERIC (20)
 MEASUREMENT (4)
 │└DURATION (4)
   Five years
 
  days
 
  former
 
  the hours
 NUMBER (11)
 │├ 2.0
 
│├ 3,000
 
│├ 5,000
 
│├ 7.24
 
│├ millions
 
│├ more than 185
 
│├ one
 
│├ one
 
│├ six
 
│├ three
 
│└ two
 ORDINAL (5)
   Third
  
 first
  
 first
  
 second
  
 third


Here, then, are the key elements that can be extracted from the blog post.

Two people from the 18th and 19th centuries now star in this story.

Thomas Bayes (1702–1761) was the son of London Presbyterian minister with a clever mathematical brain. He came up with what can be described as a way to look at these hidden parts of text and other content and find out the extent to which a particular inference is not true. For example Twitter is a big part of the Kristen Urbahn story but it is by no means the focus of the events in Pakistan.  It was just an (important) means by which information was shared across the globe. Thomas’ clever mathematics is the means by which it is possible for computers to make decisions about the probability that information can be relied on and, in that case, the role of Twitter in news distribution.

With enough information and generous computing power, of which modern man has plenty, Bayesian probability offers something like a partial belief, rather than a frequency. This allows the application of probability to all sorts of propositions rather than just ones that come with a known structure. "Bayesian" has been used in this sense since about 1950. Advancements in computing technology have allowed scientists from many disciplines to pair traditional Bayesian statistics with other techniques to greatly increase the use of Bayes theorem in science. Now, computers can both learn from experience and are beginning to be good at prediction.

Twitter was important for the Urbahn story and so, the software might tell us, Twitter will be significant for other stories in the future.

It is such techniques that modern managers need to hand if only to be able to discover emerging trends in communication and or news and events.

Fifty years after Thomas death, George Boole  (1815 – 1864) came into this world to give us all a great way of discovering information.  George (who was married to an equally mathematically brilliant wife Mary and who was the nice of the man who gave Mount Everest its name), gave us Boolean algebra (1854). Today most people know it because it is useful when searching for information using search engines. The Boolean operations AND, OR, and NOT help narrow down searches to get more closely to the facts we seek (Kristen AND Urbahn OR Forcht).

But, the use of AND, OR, and NOT in mathematics and computing has other applications and when combined with Bayesian probability (and other similar math) which means that computers can be used to make accurate, predictive and related inferences and learn, for themselves, from the results.

In practice, we find useful tools to give us insights into events.

For example http://twitris.knoesis.org/  (created at Kno.e.sis at the College of Engineering and Computer Science at Wright State University) provides us with an ontology, related Tweets, links to highly relevant web pages, a chart of Tweet rates and much more.


In practice, a manager can keep a close eye on mentions of a company, brand or product and the reputation drivers behind the Twitter stream.

No one is pretending that business managers need to understand all the technologies. There is a need, however, to know that using such advances is now becoming central to modern management and communication.


Bibliography
Kucera, H. and Francis, W.N. (1967) Computational Analysis of Present-day American English Journal: Neuroimage - NEUROIMAGE
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman (1990). "Indexing by Latent Semantic Analysis". Journal of the American Society for Information Science 41 (6): 391–407
 Boole, George (2003) [1854]. An Investigation of the Laws of Thought. Prometheus Books. ISBN 978-1-59102-089-9.
Gruber, Thomas R. (June 1993). "A translation approach to portable ontology Specifications". Knowledge Acquisition5 (2): 199–220.


Further Reading:
Introduction to LSA http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
Semantic Inference in the Human-Machine Communication http://www.springerlink.com/content/ju71rcn9pq0wcmy3/
Continuous Semantics to Analyze Real-Time Data http://wiki.knoesis.org/index.php/Continuous_Semantics_to_Analyze_Real_Time_Data
Web semantics and ontology By Johanna Wenny Rahayu http://books.google.com/books?id=K7yFJVu8NDYC


Twitter, Facebook, and dozens more sources come through Gnip's API, normalized and enriched with metadata. http://gnip.com/