BagPack: A bag-of-words approach for representing relations

I have just submitted my first paper on corpus-based semantics. With Marco Baroni, we propose a general method to build a feature space for word pairs that represents the relations between them. We call it BagPack: Bag-of-words representation of Paired concept knowledge. As the name tells it is a bag-of-words approach; keeps track of the contexts where the two words co-occur as well as the individual occurrences of each word.

Basically, BagPack provides a method for building a bag-of-words and vector-based representation for an arbitrary pair of words. Generalizing the representation and processing of relations between word pairs is not an original idea of ours. Peter Turney has already proposed such an approach based on features extracted from short connector patterns in which two words co-occur. He obtains results close to the state of the art in SAT analogy questions, reasonably high scores in TOEFL synonym questions and presents two new tasks which depict the generality of his approach: Distinguishing synonyms from antonyms and identifying similarity and association separately.

In BagPack, the feature space in which we construct our vectors is also general enough to tackle with different tasks involving relations between words. Once the vectors for a set of pairs are constructed, you can use them in different tasks without consulting the corpus again. Additionally, since BagPack represents also the contexts where the words occur on their own, it provides a fall back option in case the words in question are not observed co-occurring. We can think of two cases where this last case is helpful:

Firstly, your corpus maybe small (or the words you are looking at are very rare). For some tasks, using a larger corpus maybe the solution but if it is not possible then a fall back mechanism is necessary to get reasonable results. Secondly, we can think of semantic tasks which are capable of making you struggle with word pairs not co-occurring in the corpus no matter how large the corpus you use is. Estimating the verbal selectional preferences of verb-noun pairs is such a task. In this task, we may need to find out that apples are more likely to be eaten rather than eat. However, the productive property of language makes this task harder because it is not enough to look at the contexts where the verb and noun pair occur together. We would also like to infer that ink is neither likely to be eaten or eat whereas shabu shabu, a Japanese dish, is eaten, even if both nouns are never observed near the verb eat. I would even argue that employing a larger corpus is not a permanent solution to this problem if you push me.

For an ordered word pair, which we call the target words, we construct three sub-vectors: Two for the individual occurrences of the targets separately and one for the co-occurrences of the the two words. The concatenation of the three sub-vectors is the final vector that represents the pair. In order to build the thee sub-vectors, we identify a number of frequent words as the basis terms and for each basis term we construct a small set of features such that they incorporate the relative position of the basis term with respect to the words and the order of the target words in the context. We test our model in SAT analogy questions, TOEFL synonym questions, a selectional preference task constructed by Padó, and on a set of common-sense assertions involving 5 relations constructed from ConceptNet, a semantic network on common-sense knowledge. Our choice of machine learning algorithm is C-SVM for classification and $latex epsilon$-SVM for regression.

The results we obtained for TOEFL and SAT are parallel to those reported by Turney. For selectional preference task, the correlations between the human judgment scores and our estimates are among the highest reported so far. We also got some encouraging results in the The ConceptNet task. For now it should suffice to say that ConceptNet data set contains approximately 2000 word pairs, each connected by one or more of the 5 relations: IsA, PartOf, CapableOf, UsedFor, LocationOf. When we train a seperate binary classifier for each relation, we get AUC scores higher than 95%. An interesting result is given in the following table. We trained our classifiers on the ConceptNet pairs and then evaluated them on the SAT pairs. Here you see the SAT pairs that are classified as positive for each of the 5 relations with the highest posteriors probabilities. We have not yet quantified the generalization capacity of this method but I think the table speaks for itself for the moment.

SAT Pairs

Tags: , , ,

This post currently has one response.

  • Peter Turney
    December 21st, 2008 at 3:45 pm

    Interesting work. I look forward to reading your paper.

Share your thoughts, leave a comment!