At the end of August I spend a week in Bilbao, sponsored by the MyMedia project in which I worked when I was still at my old employer Novay, thanks to Christian Wartena, my co-author and driving force behind the papers I presented. You can skip down if you just want to read about Bilbao and the Guggenheim, but I was there for the TIR 2010 workshop on text based information retrieval. The workshop was part of DEXA2010, itself part of a group of related conferences (DAWAK/ EC-web/ TrustBus/Globe/EGOVIS/ITBAM…) going on simultaneously and held at the Deusto University Bilbao. In practice, this just meant that that there were many different simultaneous tracks to choose from. Some of the most interesting talks were in fact in tracks of conferences going to as a whole. The general theme is databases, expert systems, data management, E-commerce and data manipulation but it has branched out considerably to the point where recommender systems, semantic web, bio informatics and Grid/Cloud computing have become important subjects.
TIR workshop
Text based information retrieval was a bit of an outsider of the conference too, but arguably, the TIR workshop was just one more conference. In my opinion at least, the papers were on par with those of the rest of the conferences. It was more lively though, as befitting a workshop, and stimulated by the organizers, there was plenty of opportunity to ask questions during and after the talks. Automatic text analysis is a difficult and vast field where results are seldom conclusive. The papers therefore varied a lot but a few things stood out in the program:
• David Coquil’s talk on determining user interest from query logs by clustering the logs based on their Wordnet distance (Liman, L., Coquil, D., Kosch, H. Brunie, L. Extracting user interests from search query logs: a clustering approach TIR 2010 Bilbao)
• Elisabeth Lex talk on classifying Blogs on genre and sentiment. (Lex, E. Juffinger A., Granitzer M., comparison of stylometric and lexical features for web genre classification and emotion classification in blogs TIR 2010 Bilbao)
• Benno Stein’s talk on the way search queries can be refined to what people are really looking for using co-occurrence data (Hagen, M., Stein, B., A Heuristic Search Strategy to Improve Web Queries TIR 2010 Bilbao)
• Leonard Hennig’s talk on finding sentence level content units using co-occurrence and latent Dirichlet allocation. (Hennig L., Strecker, T., Narr, S., de Luca, E.W., Albayrak, S. Identifying sentence level Semantic Content Units with Topic Models TIR 2010 Bilbao)
I myself presented two papers on the problem of finding keywords from text using two diagonally opposed methods. In both cases the real problem is to find a relevance measure for words for use as a keyword. There are two components of that relevance: the relevance of the word for the document and the relevance of the word in the rest of the world. Relevance of the word for the document is easy: just count the number of occurrences (basically). The methods differ in the way they take the world into account.
The first method follows PhD work from Luit Gazendam and is based on the assumption that you have a large structured thesaurus to choose your words from. For professional archiving this is usually obligatory, and it drastically reduces the amount of noise compared to user selected tags. Such thesauri are not just lists of words however, but come with thesaurus relations that indicate whether terms are or synonyms, whether one term is broader or narrower than another, or whether words are semantically related in other ways. The idea is then to use these thesaurus relations as a proxy for the relevance of the word in the world and count the number of relations that exist between words (possibly indirect). After suitable normalization, we use that count as a measure for the relevance in the document. The important point is that in this way you can extract keywords of an isolated document: unlike standard information retrieval measures like tf-idf, you donot need a large document collection to compute it. In fact it seems to work about as well as tf-idf (that is to say, surprisingly well but not stellar). Determining that is actually the most difficult part of the paper: you need enough documents that are annotated by professional annotators as a gold standard to compare with the keywords that your algorithm gives. That is not so easy to get hold of, but fortunately Luit spend part of his PhD time at Sound and Vision where they do that kind of thing. Unfortunately this so called gold standard is actually more like modern finance and shaky of itself: annotators only agree among themselves in about 60% of the cases. This is far far better than a random choice (there are 30000 words to choose from), and both sets of keywords tend to make sense, but it makes it a lot harder to evaluate algorithms (although in the spirit of the Dutch philosopher Cruijff, it also easier to do about as well as humans).
The other method is based on research from Christian Wartena and myself. The problem is again to determine the keywords of a document but this time we work under the assumption that we do have a large and representative corpus of documents. We then use a central notion in our joint work: the co-occurrence distribution of words. This is a probability distribution over words that counts how often a word co-occurs together with an other word in a document. If you assume that documents reflect the state of the world, then this co-occurrence distribution is a proxy for the meaning of a word. While this is a very crude way to think of the meaning of words, it is an instance of the distributional hypothesis of Harris which simply assumes that words co-occur in a document because there is a good semantic reason in the real world. Co-occurrence of words is thus like a platonic shadow of the real semantics. It is also very computable and simple, and moreover it allows you to operationalize intuitive notions like semantic proximity, with computable information theoretic proximity measures like the Jensen Shannon divergence or statistical measures like correlation coefficients. However, clearly this only has a chance of working if you start with enough statistics, that is if you have a large document collection. Again the main problem turns out to be a good evaluation (which involves generating multiword keywords, arguably the weakest link in the whole setup but one that plagues all methods we evaluated). Another problem is that it works less well than we hoped: you get better results than for tf-idf but not all that much, and at considerable cost in computation time and complexity . (Gazendam, L. Wartena, C. Brussee, R. Thesaurus based term ranking for keyword extraction, and Wartena, C. Brussee, R. , Slakhorst, W. Keyword extraction using word co-occurrence TIR 2010 Bilbao).

Rest of the conference
The invited speakers for the conference ranged too. I heard the well known computer science professor (Baeza-Yates), now head of Yahoo research, speaking on the economics and algorithms of web advertising which gives rise to rather interesting game-theoretic problems for pricing (of the Nash and Morgenstern, von Neumann type rather then the Doom type) and practical Turing tests to filter out robots acting like people and people acting like robots to manipulate the price of a click or just make a few bucks. I also heard the classic database professor Oscar Pastor speaking of the need for an ontology to integrate the vast genomic databases by creating a common vocabulary, paralleling the DICE proposal.
There was also a Dutch civil servant from the immigration office who spoke about the largely automated process for judging immigration applications on which they are apparently quite proud of. I did not go there (for fear of getting angry, and because there were other interesting talks), so should give people the benefit of the doubt
but judging from the slides somebody decided that after the political minefield that is the immigration politics exploded in the Verdonk Hirsi-Ali debacle, they needed the ultimate answer to political responsibility: we did not do it, the computer applied the rules. Geert Wilders will be very pleased that the system is flexible, and that a “business rule” “muslim NJET” should be easy to implement.

As expected the regular sessions were more technical, some delving into parallel query optimization or power management of disk drives. I must admit that, not being a database expert, I avoided most of the more technical ones although I had my share of hardcore semantic web stuff presented by barely comprehensible Chinese graduate students. One can not overemphasize the importance of solid highly technical papers, but there were simply plenty of other talks to go to. As far as I am concerned, some of the highlights of the conference were the following
• The talk by Jonathan Gemell on tag based recommendation and clever ways to mix it with collaborative filtering (Gemell, J. Schimoler, T. Mobasher, B., Burke, R. resource recommendation in collaborative tagging applications EC-web 2010 Bilbao),
• The talk by Mathias Bank on text mining social networks to understand what people think about cars (he works for Mercedes Benz) (Franke, J., Bank, M., Social networks as Data Source for Recommendation Systems EC-web 2010 ),
• Hendrick Decker’s talk on ways to deal with violated constraints in databases: one should think of them as violated assumptions, determine why assumption are violated and which part of the database is unaffected. All of this is something you can say sensible things about by tracking the combination of constraints that are triggered, the causes of the violation. (Hendrick Decker, Basic causes for the inconsistency tolerance of query answering and integrity checking, FlexDBIST 2010 Bilbao)
• Bart Knijnenburg’s talk on the way that recommender systems affect people’s appreciation for what they get recommended. Simply put if you recommend movies people rate them differently. For example if you get recommended two movies and you like them both but one more than the other you tend not rate them in the same way. This is a bad news for designers of recommenders which try to callibrate their recommendations by training on a large database of rated items. People also like a bit of diversity so you if you only ever recommend them Star-trek episodes they get bored (replace with GTST if necessary). In the end people do value the recommender though. I already knew Bart from the MyMedia project and we had some beers together. (Knijnenburg, B. Willemsen, M. Hirtbach, S., Receiving Recommendations and Providing Feedback: The User Experience of a Recommender System EC-Web 2010)
• Maciej Dabrowski’s talk on the use of Pareto optimality in recommender systems. I had met Maciej a few days earlier on the conference dinner and we had already discussed a few things and had drunk some wine together (Drabrowski M., Acton, T. Comparing techniques for preference relaxation: a decision theory perspectiv EC-web 2010)
• Ilham Esslimani’s talk on using social networks to find “opinion leaders” whose taste is followed by others. (Esslimani, I. , Brun, A. , Boyer, A., Detecting leaders to alleviate latency in recommendations EC-web 2010).
Bilbao and the Guggenheim museum

Bilbao is an interesting city. It is located on the northern coast in the green part of Spain fairly close to the French border along a deep inland estuary, the Nervion . In fact the mountains around Bilbao are among the wettest parts of Europe which gives the area a mix of northern and southern atmosphere. It also lies along the old pilgrim route to Santiago de Compostella which links it to the rest of Europe. In reality however Bilbao does not want to be linked to the rest of Europe but independent and different from anything else.

From antiquity on, the region was known for its harbor, iron ore and its wild Basques. The Basques are the last remains of Europe’s pre-Indo-European population speaking a language unrelated to any other language in the world.

Traditionally, they are a raucous bunch, the last people on the Iberian peninsula to be conquered by the Romans, long keeping both the Arabs and Charlemagne at bay, fishing for cod on the high seas, and involved in a permanent struggle with Spain’s central authority.
The iron ore and the harbor resulted in a 19th century boom when the city became a centre of ship building, steel mills and other heavy iron industry. In the same period, it grew out of its medieval centre with a new quarter in glorious English Victorian and French empire style.

Fast forward to the 20th century: after Franco’s death Spain becomes a member of the EU and as a result the heavy industry collapses. Unemployment rises to 30% . So what do they do, they decide to break down all the heavy industry, rebuild the area with university buildings and a business centre, but above all they decide to spend 127.5 million dollar on a museum: the Guggenheim Bilbao. It worked. Bilbao is now known for the Guggenheim rather than for steel and the city, while still looking relatively poor, seems to be lively and optimistic, with lots of places to eat tapas, lots of bookstores, theaters and shops.

The Guggenheim Bilbao is above all an outrageous building. It was designed by Frank Gehry and is widely considered to be one of his best buildings. It obviously inspired some of his other designs like the Disney music hall in Los Angeles, and its curvy shapes are so complex that they had to adapt CAD-CAM software from the aerospace industry to actually build it.

Its lush matt titanium coating makes it look like a giant stack of fishes, softly glittering in the sun, reflecting the limestone buildings and the brightly red coloured bridge that are integrated in its structure, and diffracting the colours of the sunlight in the evening.

Overlooking the riverlike estuary Nervion, it is reflected in the water, sometimes shrouded in artificially created fog.

Once inside, you enter an enormous high rising hall, all curvy, lit by sunlight that pours in from above and through a glass front under a giant titanium mansard that gives access to a terrace on the riverside. In all it is spectacularly beautiful, and lives up to its reputation as a timeless masterpiece.

The actual collection is mostly dedicated to avant garde art. I really liked the cavernous wing passing under the bridge, containing the matter of time by Richard Serra which is somehow in line with the exuberance of the rest of the building. The work consists of very large labyrinthine structures from reddish corroded steel plates, in reference to Bilbao’s iron past.

There was also a large installation of LED columns streaming words in Spanish and Basque (on death, as the it was designed for an AIDS fundraiser). I secretly tried to make a video because the image fitted rather nicely with my talk for the next day but alas, I had to stop after a little while because the overseer came in.

The rest of the comparatively small permanent collection varied from a large Warhol with rows and rows of faces of Marilyn Monroe that seemed to be scratched into a black surface over a multi coloured background to sculptures from local Basques artist.
The exhibition of Amish Kapoor was interesting: reflecting curved mirrors, forms made of intensely coloured red yellow and black powder, holes in boxes that revealed a pitch black interior (the standard model for a black body radiator by the way), a giant clock that scratched in a mountain of red wax and a canon that shot blobs of red wax on a wall (I do have a movie of that one!). Yet I could not help laughing at the reverence of the audio tour and the carefully constructed stories of the artist at times. Obviously a museum like this has to think carefully about marketing and the Guggenheim “brand”.

I also visited the local art museum which had both a nice collection of 16th century Flemish paintings and a good collection of modern art.
The last afternoon we spend on the unofficial DEXA2010 beach exhibition, and managed to reach the sea at Plentzia. We spotted a sun tanned surfer that obviously enjoyed life in different ways, which showed I could have spent my week differently, but I had no regrets.

tagged with: wilders, ec-web2010, dexa2010, tir2010, automatic text analysis, information retrieval, guggenheim, bilbao, conference
Related posts
-
Form and Substance, Infographics 2010-03-05
by Rogier Brussee -
Terrorism is crossmedia
by Rogier Brussee -
Symposium "15 years SWOCC"
by Rogier Brussee -
Paper accepted!
by Erik Hekman
Other posts by Rogier
- The Big Bang Theory
- The Great Media Strategy Game
- The Journal Club and Roland Barthes, An Introduction to the Structural Analysis of Narrative
- Crossmediale opwinding met Dieuwertje en de rolstoel piet.
- Symposium "15 years SWOCC"
- Terrorism is crossmedia
- Nuclear crisis in 140 characters
- A tale of two Mediabattles
- Wikileaks and Cablegate
- Hyves and de Telegraaf


Comments
Nice pictures! Looks like the conference was a success!
Comment on this post