Friday, June 03, 2011

The search for hidden meanings

Throughout written history, people have engaged in finding the hidden meaning in writing.

Fascination at the hieroglyphs on the walls ancient Egyptian temples and burial sites extends back well before  4 PM on November 26, 1922 when Howard Carter’s search for hidden meanings resulted in the discovery of the 3300 year old and untouched tomb of 19 years old king Tutankhamun .

Today, we are even more fascinated with exploring our written (and spoken) language.

And it all comes down to what is known as Part-of-speech tagging (POS tagging or POST).

Most of us have done it at school by identifying words as nouns, verbs, adjectives, adverbs, etc.

Back when the Beatles were at their peak, America and its allies were embroiled in the Vietnam war, Dr Christiaan Barnard carried out the world's first human heart transplant and The Six Day War was fought in the Middle East, NASA launched an unmanned Apollo 4 test spacecraft and Britons got their first colour television programmes . But in that same year the one development that affects more people today and will do in the future is the work of Henry Kucera and W. Nelson Francis.  They published their classic work Computational Analysis of Present-Day American English (1967), which provided basic analysis about words in texts on what is known today simply as the Brown Corpus.

Henry Kucera and W. Nelson Francis did more complicated analysis than getting computers to find nouns and verbs but the principle is the same. It is a process largely based on relationships with adjacent and related words in a phrase, sentence, or paragraph. 

Once performed by hand, POS tagging is now done in the context of the son of the Brown Corpus, computational linguistics. It uses algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags or forms of description or, more recently, that are created as they are found ‘on the fly’.

The reason that Kucera and  Francis work is so important is that we have built a whole new form of society on this idea.

Clever scientists have used this idea of extracting hidden meaning to develop a new form of internet.

One of these ideas came from three academics Scott Deerwester, Susan  Dumais, George Furnas, Thomas Landauer and Richard Harshman (1990). They outlined how to analyse relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Called Latent semantic Analysis (LSA), the idea assumes that words that are close in meaning will occur close together in text.  

This idea is used by all manner of analysis programmes and helps find those hidden meanings.

In their paper they says “...Thus while LSA’s potential knowledge is surely imperfect, we believe it can offer a close enough approximation to people’s knowledge to underwrite theories and tests of theories of cognition.”  Since 1990, academics have come a long way and accuracy is getting ever closer to social reality. 

Today, the use of semantics makes the Google and Bing web search algorithms more accurate, helps newspaper journalists find the most authoritative sources for information and informs the top companies about events and their drivers to optimise financial, marketing and communication decisions.

Remember Kristen Urbahn’s story I blogged about three weeks ago? It has lots of hidden meanings. Using’s special search engine Kristen can find out about the relationships between different parts of the story (using automated Part of Speech tagging).

The results show the nature of some of the significant words:

││└ Kathy Griffin (4)
││  her 
││└ Donald Rumsfeld
││└ Obama (3)
││  Obama
││  Obama
││  his 
│├ Brian Williams
│├ Dan Pfeiffer
│├ Jill Jackson (2)
  Jill Jackson
│├ Keith (6)
  Keith Urbahn
 │├ Kristen Urbahn (13)
  Kristen Urbahn
  Kristen Urbahn
│├ Maggie Fox
│├ Osama Bin Laden (6)
  Osama Bin Laden
  Bin Laden 
│├ Osama Bin Ladin (5)
  Osama Bin Ladin
│└ Sohaib Athar

│├GPE (13)
││├COUNTRY (5)
│││├ Afghanistan
│││├ Pakistan (2)
│││  Pakistan 

│││└ US (2)
││├CITY (4)
│││├ Abbottabad
│││├ Denver
│││├ Guardian
│││└ San Francisco
││└US STATE (4)
││  Kansas
││  South Carolina
││  Washington (2)
││   Washington
││   Washington 
│└ Wiltshire
││├MEDIA ORG (7)
││││└TV NETWORK (5)
││││  BBC
││││  CBS
││││  CNN (2)
││││  NBC
│││├ New York Times
│││└ Washington Times
││├ Defence
││├ Google
││├ Social Media Group

││└ Twitter (5)
││  Twitter 

││├ Al Qaeda
││└ Republican Leaders Office
  Preston University
  University of Kentucky
│└URL (1)
 HTTP (1)

OTHER (18)
││  White House (4)
││   White House Communication Director
││   White House 

││├ Capitol Hill
││├ Christian
││├ Creative Commons
││├ Dachshunds
││├ Internet
││├ Internet
││├ Mobile
││├ President Obama
││├ Royal Wedding
││└ The New York Times
  Facebook (3)

││├DATE (2)
│││├ Aug. 18, 2009
│││└ May 1 2011
│││└ 1 May
│││└ May
│││├ months ago
│││└ the evening
││└YEAR (2)
││  2006
││  2011

│└TIME (8)
  10:30 p.m. Eastern Time
  10:40 p.m.
  11 p.m.
  4pm EST
  9:45 p.m.
  from 10:45 p.m.-2:20 a.m.
   Five years
  the hours
 NUMBER (11)
 │├ 2.0
│├ 3,000
│├ 5,000
│├ 7.24
│├ millions
│├ more than 185
│├ one
│├ one
│├ six
│├ three
│└ two

Here, then, are the key elements that can be extracted from the blog post.

Two people from the 18th and 19th centuries now star in this story.

Thomas Bayes (1702–1761) was the son of London Presbyterian minister with a clever mathematical brain. He came up with what can be described as a way to look at these hidden parts of text and other content and find out the extent to which a particular inference is not true. For example Twitter is a big part of the Kristen Urbahn story but it is by no means the focus of the events in Pakistan.  It was just an (important) means by which information was shared across the globe. Thomas’ clever mathematics is the means by which it is possible for computers to make decisions about the probability that information can be relied on and, in that case, the role of Twitter in news distribution.

With enough information and generous computing power, of which modern man has plenty, Bayesian probability offers something like a partial belief, rather than a frequency. This allows the application of probability to all sorts of propositions rather than just ones that come with a known structure. "Bayesian" has been used in this sense since about 1950. Advancements in computing technology have allowed scientists from many disciplines to pair traditional Bayesian statistics with other techniques to greatly increase the use of Bayes theorem in science. Now, computers can both learn from experience and are beginning to be good at prediction.

Twitter was important for the Urbahn story and so, the software might tell us, Twitter will be significant for other stories in the future.

It is such techniques that modern managers need to hand if only to be able to discover emerging trends in communication and or news and events.

Fifty years after Thomas death, George Boole  (1815 – 1864) came into this world to give us all a great way of discovering information.  George (who was married to an equally mathematically brilliant wife Mary and who was the nice of the man who gave Mount Everest its name), gave us Boolean algebra (1854). Today most people know it because it is useful when searching for information using search engines. The Boolean operations AND, OR, and NOT help narrow down searches to get more closely to the facts we seek (Kristen AND Urbahn OR Forcht).

But, the use of AND, OR, and NOT in mathematics and computing has other applications and when combined with Bayesian probability (and other similar math) which means that computers can be used to make accurate, predictive and related inferences and learn, for themselves, from the results.

In practice, we find useful tools to give us insights into events.

For example  (created at Kno.e.sis at the College of Engineering and Computer Science at Wright State University) provides us with an ontology, related Tweets, links to highly relevant web pages, a chart of Tweet rates and much more.

In practice, a manager can keep a close eye on mentions of a company, brand or product and the reputation drivers behind the Twitter stream.

No one is pretending that business managers need to understand all the technologies. There is a need, however, to know that using such advances is now becoming central to modern management and communication.

Kucera, H. and Francis, W.N. (1967) Computational Analysis of Present-day American English Journal: Neuroimage - NEUROIMAGE
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman (1990). "Indexing by Latent Semantic Analysis". Journal of the American Society for Information Science 41 (6): 391–407
 Boole, George (2003) [1854]. An Investigation of the Laws of Thought. Prometheus Books. ISBN 978-1-59102-089-9.
Gruber, Thomas R. (June 1993). "A translation approach to portable ontology Specifications". Knowledge Acquisition5 (2): 199–220.

Further Reading:
Introduction to LSA
Semantic Inference in the Human-Machine Communication
Continuous Semantics to Analyze Real-Time Data
Web semantics and ontology By Johanna Wenny Rahayu

Twitter, Facebook, and dozens more sources come through Gnip's API, normalized and enriched with metadata.

No comments:

Post a comment