Edit October 26, 2014: Update imports for TextBlob>=0.8.0.
In short, WordNet is a database of English words that are linked together by their semantic relationships. It is like a supercharged dictionary/thesaurus with a graph structure.
This tutorial is a gentle introduction to WordNet concepts, using TextBlob for the examples. To follow along with the examples, make sure you have the latest version of TextBlob.
$ pip install -U textblob
As you know, synonyms are words that have similar meanings. A synonym set, or synset, is a group of synonyms. A synset, therefore, corresponds to an abstract concept.
In TextBlob, you can access the synsets that a word belongs to by accessing the
synsets property of a
from textblob import Word word = Word("plant") word.synsets[:5] # [Synset('plant.n.01'), # Synset('plant.n.02'), # Synset('plant.n.03'), # Synset('plant.n.04'), # Synset('plant.v.01')]
It would be helpful to know the definitions of these synsets. You can access these via the
word.definitions[:5] # ['buildings for carrying on industrial labor', # '(botany) a living organism lacking the power of locomotion', # 'an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience', # 'something planted secretly for discovery by another', # 'put or set (seeds, seedlings, or plants) into the ground']
For this tutorial, let's study "plant" as a living organism. We can see from our
definitions list that this is the second synset (index 1).
plant = word.synsets
The synonyms contained within a synset are called lemmas. You can access the string versions of these synonyms via a
plant.lemma_names # ['plant', 'flora', 'plant_life']
The Wordnet Hierarchy
Synsets form relations with other synsets to form a hierarchy of concepts, ranging from very general ("entity", "state") to moderately abstract ("animal") to very specific ("plankton").
Some terminology: For a given synset, its. . .
- hypernyms are the synsets that are more general
- hyponyms are the synsets that are more specific
Hyponyms have an "is-a" relationship to their hypernyms.
Let's use our plant synset as an example.
plant.hypernyms() # [Synset('organism.n.01')] plant.hyponyms()[:3] # [Synset('phytoplankton.n.01'), Synset('aquatic.n.01'), Synset('perennial.n.01')]
We can therefore see the following relationships:
Along with "is-a" relationships, we can explore "is-made-of" and "comprises" relationships.
For a given synset, its. . .
- holonyms are things that the item is contained in
- meronyms are components or substances that make up the item
Let's look at our plant example again.
plant.member_holonyms() # [Synset('plantae.n.01')] plant.part_meronyms() # [Synset('plant_part.n.01'), Synset('hood.n.02')]
This shows the following relationships:
Given that synsets can be organized as a graph, as shown above, we can measure
the similarity of synsets based on the shortest path between them. This is
called the path similarity, and it is equal to
(shortest_path_distance(synset1, synset2) + 1). It ranges from 0.0 (least
similar) to 1.0 (identical).
Let's compare the path similarities between "octopus" and "nautilus" (another cephalapod), "shrimp" (a non-cephalopod), and "pearl" (a mineral). We'll create the synsets directly.
from textblob.wordnet import Synset octopus = Synset("octopus.n.02") nautilus = Synset('paper_nautilus.n.01') shrimp = Synset('shrimp.n.03') pearl = Synset('pearl.n.01')
The results are as expected, with octopus more similar to another cephalopod than a non-cephalapod and most dissimilar to a non-living thing.
octopus.path_similarity(octopus) # 1.0 octopus.path_similarity(nautilus) # 0.33 octopus.path_similarity(shrimp) # 0.11 octopus.path_similarity(pearl) # 0.07
There are other WordNet-based measures of similarity to explore, which can be found at the NLTK Wordnet Docs. All these methods are accessible in TextBlob.