stevenloria.com

Tutorial: What is WordNet? A Conceptual Introduction Using Python

Sep 30, 2013

In short, WordNet is a database of English words that are linked together by their semantic relationships. It is like a supercharged dictionary/thesaurus with a graph structure.

TextBlob 0.7 (changelog) now integrates NLTK's WordNet interface, making it very simple to interact with WordNet.

This tutorial is a gentle introduction to WordNet concepts, using TextBlob for the examples. To follow along with the examples, make sure you have the latest version of TextBlob.

$ pip install -U textblob

Synsets

As you know, synonyms are words that have similar meanings. A synonym set, or synset, is a group of synonyms. A synset, therefore, corresponds to an abstract concept.

In TextBlob, you can access the synsets that a word belongs to by accessing the synsets property of a Word object.

1
2
3
4
5
6
7
8
from text.blob import Word
word = Word("plant")
word.synsets[:5]
# [Synset('plant.n.01'),
#  Synset('plant.n.02'),
#  Synset('plant.n.03'),
#  Synset('plant.n.04'),
#  Synset('plant.v.01')]

It would be helpful to know the definitions of these synsets. You can access these via the definitions property.

1
2
3
4
5
6
word.definitions[:5]
# ['buildings for carrying on industrial labor',
#  '(botany) a living organism lacking the power of locomotion',
#  'an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience',
#  'something planted secretly for discovery by another',
#  'put or set (seeds, seedlings, or plants) into the ground']

For this tutorial, let's study "plant" as a living organism. We can see from our definitions list that this is the second synset (index 1).

plant = word.synsets[1]

The synonyms contained within a synset are called lemmas. You can access the string versions of these synonyms via a Synset's lemma_names property.

plant.lemma_names
# ['plant', 'flora', 'plant_life']

The Wordnet Hierarchy

Synsets form relations with other synsets to form a hierarchy of concepts, ranging from very general ("entity", "state") to moderately abstract ("animal") to very specific ("plankton").

Some terminology: For a given synset, its. . .

  • hypernyms are the synsets that are more general
  • hyponyms are the synsets that are more specific

Hyponyms have an "is-a" relationship to their hypernyms.

Let's use our plant synset as an example.

1
2
3
4
plant.hypernyms()
# [Synset('organism.n.01')]
plant.hyponyms()[:3]
# [Synset('phytoplankton.n.01'), Synset('aquatic.n.01'), Synset('perennial.n.01')]

We can therefore see the following relationships:

Hypernyms and hyponyms graph

Along with "is-a" relationships, we can explore "is-made-of" and "comprises" relationships.

For a given synset, its. . .

  • holonyms are things that the item is contained in
  • meronyms are components or substances that make up the item

Let's look at our plant example again.

1
2
3
4
plant.member_holonyms()
# [Synset('plantae.n.01')]
plant.part_meronyms()
# [Synset('plant_part.n.01'), Synset('hood.n.02')]

This shows the following relationships:

Holonym and meronym graph

Semantic similarity

Given that synsets can be organized as a graph, as shown above, we can measure the similarity of synsets based on the shortest path between them. This is called the path similarity, and it is equal to 1 / (shortest_path_distance(synset1, synset2) + 1). It ranges from 0.0 (least similar) to 1.0 (identical).

Let's compare the path similarities between "octopus" and "nautilus" (another cephalapod), "shrimp" (a non-cephalopod), and "pearl" (a mineral). We'll create the synsets directly.

1
2
3
4
5
from text.wordnet import Synset
octopus = Synset("octopus.n.02")
nautilus = Synset('paper_nautilus.n.01')
shrimp = Synset('shrimp.n.03')
pearl = Synset('pearl.n.01')

The results are as expected, with octopus more similar to another cephalopod than a non-cephalapod and most dissimilar to a non-living thing.

1
2
3
4
octopus.path_similarity(octopus)  # 1.0
octopus.path_similarity(nautilus)  # 0.33
octopus.path_similarity(shrimp)  # 0.11
octopus.path_similarity(pearl)  # 0.07

There are other WordNet-based measures of similarity to explore, which can be found at the NLTK Wordnet Docs. All these methods are accessible in TextBlob.

Further reading

Other Tutorials in this Series

tagged in python textblob nlp

Please send comments by email. I welcome your feedback, advice, and criticism.