I could spend a lot of time writing a definition of semantics or the semantic web. I could show how the inventor of the web Tim Berners-Lee finds it all absorbing, why Google thinks its is essential to its future survival, and how some serious thinkers see how it is important for the future of society.
It's a much more fun to put on a practical demonstration. That is what I am going to do.
The demonstration will seek to show that it is possible to identify as a moment in time the key semantic notions that define a genre and individuals in the genre.
The methodology I shall apply is listed in this post but I shall also provide the practitioner with the tools that allow practitioners and researchers to replicate the findings.
To ensure that this is a relevant case study, I shall take an example of major competitive public relations campaigns, the UK General Election. Specifically I shall look at the semantic similarities and differences of the three leaders: David Cameron, Conservative; Gordon Brown, Labour and Nick Clegg, LibDem.
This is a big project and we are limited (by the technological challenge I face) to sampling the corpus. In the future we do not have to be limited by such constraints.
The methodology I am able to use is as follows.
- Every 40 minutes I shall use and automated bot to interrogate the internet to identify new web pages published in a day which mention each of the three major party leaders. I anticipate this will be of the order of 200,000/300,000 every day (or more). Of these I will select 1000 pages (citations) on the basis of number of views and mentions of the leaders in headlines and first paragraph. This content will include publicly available items of: news media pages in online newspapers, magazines and other news outlets (offering news that is not hidden behind robot blocks and paywalls); blog posts, Twitter tweets, Social Network contributions, wiki pages, Bulletin Boards, discussion lists, List Serve, Sidewikis, comments about photographs and videos, slideshows and other web based pages.
- Each of these selected citations will be parsed (software available here) to extract the the contiguous text which will be retained for further analysis together with an audit trail giving date found and URL.
- Each citation will then be parsed using latent semantic indexing software which will identify the semantic concepts in each citation (here is software that you can use to extract concepts from web pages).
- I will then rank the concepts in order of frequency of use in the citations for each day. This will provide a rather boring list of words and their daily count.
- To make it easy to see the result and to compare the three Party Leaders, I will use a wordwall for visualisation purposes so that you can compare the most significant semantic concepts for each of the three selected leaders.
- These will be posted on this blog every day until polling day.
- This is a proof of concept demonstration showing the semantic differences between the three competitors.
- This will show how using a sample of online content selected for its reach and readership the web reports the three campaigns.
- The analysis will show how these citations represent an online view of the competitors' similarities and differences.
- It shows how all manner of online influences can represent the three candidates.