Guillaume Bonfante, Bruno Guillaume, Guy Perrier

Verlag Wiley-ISTE, 2018

ISBN 9781119522348 , 272 Seiten

Format ePUB

Kopierschutz DRM

Geräte

Mehr zum Inhalt

Application of Graph Rewriting to Natural Language Processing

Kapitelkauf
Kurzinformation
Inhaltsverzeichnis
Leseprobe
Blick ins Buch
Fragen zum eBook

1
Programming with Graphs

In this chapter, we shall discuss elements of programming for graphs. Our work is based on PYTHON, a language widely used for natural language processing, as in the case of the NLTK library1 (Natural Language ToolKit), used in our work. However, the elements presented here can easily be translated into another language. Several different data structures may be used to manage graphs. We chose to use dictionaries; this structure is elementary (i.e. unencapsulated), reasonably efficient and extensible. For what follows, we recommend opening an interactive PYTHON session2.

Notes for advanced programmers: by choosing such a primitive structure, we do not have the option to use sophisticated error management mechanisms. There is no domain (or type) verification, no identifier verification, etc. Generally speaking, we shall restrict ourselves to the bare minimum in this area for reasons of time and space. Furthermore, we have chosen not to use encapsulation so that the structure remains as transparent as possible. Defining a class for graphs should make it easier to implement major projects. Readers are encouraged to take a more rigorous approach to that presented here after reading the book. Finally, note that the algorithms used here are not always optimal; once again, our primary aim is to improve readability.

Notes for “beginner” programmers: this book is not intended as an introduction to PYTHON, and we presume that readers have some knowledge of the language, specifically with regard to the use of lists, dictionaries and sets.

The question will be approached from a mathematical perspective in Chapter 7, but for now, we shall simply make use of an intuitive definition of graphs. A graph is a set of nodes connected by labeled edges. The nodes are also labeled (with a phonological form, a feature structure, a logical predicate, etc.). The examples of graphs used in this chapter are dependency structures, which simply connect words in a sentence using syntactic functions. The nodes in these graphs are words (W1, W2, …, W5 in the example below), the edges are links (suj, obj, det) and the labels on the nodes provide the phonology associated with each node. We shall consider the linguistic aspects of dependency structures in Chapter 2.

Note that it is important to distinguish between nodes and their labels. This enables us to differentiate between the two occurrences of "the" in the graph above, corresponding to the two nodes W1 and W4.

In what follows, the nodes in the figures will not be named for ease of reading, but they can be found in the code in the form of strings: 'W1', 'W2', etc.

1.1. Creating a graph

A graph is represented using a dictionary. Let us start with a graph with no nodes or edges.

g = dict ()

The nodes are dictionary keys. The value corresponding to key v is a pair (a, sucs) made up of a label a and of the list sucs of labeled edges starting from v. Let us add a node 'W1' labeled "the" to g:

g[‘W1’] = (‘the’, [])

Now, add a second and a third node, with the edges that connect them:

g[‘W2’] = (‘child’, []) g[‘W3’] = (‘plays’, []) g[‘W3’][1].append ((‘suj’, ‘W2’)) g[‘W2’][1].append ((‘det’, ‘W1’))

and print the result:

g

{‘W1’: (‘the’, []), ‘W2’: (‘child’, [(‘det’, ‘W1’)]), ‘W3’: (‘plays’, [(‘suj’, ‘W2’)])}

The last box shows the output from the PYTHON interpreter. This graph is represented in the following form:

Let us return to the list of successors of a node. This is given in the form of a list of pairs (e, t), indicating the label e of the edge and the identifier t of the target node. In our example, the list of successors of node 'W2' is given by g['W2'][1]. It contains a single pair ('det', 'W1') indicating that the node 'W1' corresponding to "the" is the determiner of 'W2', i.e. the common noun "child".

In practice, it is easier to use construction functions:

def add_node(g, u, a):   #Add a node u labeled a in graph g   g[u] = (a, [])   def add_edge(g, u, e, v):   # Add an edge labeled e from u to v in graph g   if (e, v) not in g[u][1]:     g[u][1].append((e, v))

This may be used as follows:

add_node(g, ‘W4’, ‘the’) add_node(g, ‘W5’, ‘fool’) add_edge(g, ‘W3’, ‘obj’, ‘W5’) add_edge(g, ‘W5’, ‘det’, ‘W4’)

to construct the dependency structure of the sentence "the child plays the fool":

Let us end with the segmentation of a sentence into words. This is represented as a flat graph, connecting words in their order; we add an edge, 'SUC', between each word and its successor. Thus, for the sentence "She takes a glass", we obtain:

The following solution integrates the NLTK segmenter:

import nltk word_list = nltk.word_tokenize(“She takes a glass”) word_graph = dict () for i in range(len(word_list)):     add_node(word_graph, ‘W%s’ % i, word_list[i]) for i in range(len(word_list) - 1):     add_edge(word_graph, ‘W%s’ % i, ‘SUC’, ‘W%s’ % (i + 1)) word_graph

{‘W3’: (‘glass’, []), ‘W1’: (‘takes’, [(‘SUC’, ‘W2’)]), ‘W2’: (‘a’, [(‘SUC’, ‘W3’)]), ‘W0’: (‘She’, [(‘SUC’, ‘W1’)])}

Readers may wish to practice using the two exercises as follows.

EXERCISE 1.1.– Finish constructing the following flat graph so that there is a 'SUC*' edge between each word and one of its distant successors. For example, the chain "She takes a glass" will be transformed as follows:

EXERCISE 1.2.– Write a function to compute a graph in which all of the edges have been reversed. For example, we go from the dependency structure of "the child plays the fool" to:

1.2. Feature structures

So far, node labels have been limited to their phonological form, i.e. a string of characters. Richer forms of structure, namely feature structures, may be required. Once again, we shall use a dictionary:

fs_plays = {‘phon’ : ‘plays’, ‘cat’ : ‘V’}

The fs_plays dictionary designates a feature structure with two features, 'phon' and 'cat', with respective values 'plays' and 'V'. To find the category 'cat' of the feature structure fs_plays, we...