We have seen in the last post about graph signal processing and the Fourier transformation of graph data. In summary, transforming a 1D signal defined on graphs with \(n\) nodes (denoted as \(\mathbf{g} \in \mathbb{R}^n\)) requires the matrix of eigenvectors of the Laplacian matrix \(L = D - A\). Assuming that we have decomposed the Laplacian as \(L = U\Lambda U^T\), the transformation can be written as:
\[\hat{\mathbf{g}}[l] = {\bf u}_l^{T}{\bf g}\]
where \(\mathbf{u}_l\) is the \(l\)-th column in \(U\).
Filtering operations are also defined in the frequency domain. Now following the notation from (Defferrard et al., 2016), the convolution operation between two signals \(x\) and \(y\) on graph \(G\) is defined as:
\[x *_{G}y = U((U^Tx)\odot(U^Ty)),\]
where it transforms both \(x\) and \(y\) to the frequency domain (\(U^Tx\) and \(U^T y\)), perform the convolution operation (\((U^Tx)\odot(U^Ty)\)), and return to the original domain (\(U((U^Tx)\odot(U^Ty))\)). Note that from the convolution theorem, element-wise multiplications in the frequency domain is the same as convolution on the original domain. In the previous post, \(U^Ty\) is analogous to the filtering operation \(h(\Lambda)\), and \(x\) to \(\mathbf{g}\) (or vice versa, actually).
However, the most widely used form of graph convolution (GCN model from Kipf & Welling) is still far from the above formulation. Between GCN and the classical graph Fourier transform, an intermediate work (Defferrard et al., 2016) has been proposed to reduce the complexity, and thus proposing a learnable graph learning model that later inspired (Defferrard et al., 2016). We will try to understand the thought behind (Defferrard et al., 2016) and perform some analysis using an experiment proposed by (Alon & Yahav, 2021).
ChebConv
Firstly, let us assume that we are interested in learning a filter (or graph neural network) on a given graph \(G\). Based on the description above, the straightforward approach of modeling \(h(\Lambda)\) is just making the vector \(x \in \mathbb{R}^n\) learnable. This introduces a total of \(n\) parameters (this is the non-parametric filter in ChebConv).
Two modifications from non-parametric filters
The first modification of ChebConv is to reduce the number of parameters from \(n\) to \(K\) (of course, \(K < n\)) by setting \(h(\Lambda)\) as a polynomial filter. That is, the filter \(h(\Lambda)\) is now expressed with respect to polynomial basis (which is literally polynomials of \(\Lambda\): \(I\), \(\Lambda\), \(\Lambda^2\), \(\cdots\)) with coefficients \(\theta_k\):
Putting this version of the filter to process the graph signal \(x\):
\[\mathbf{h}(\Lambda)x = U (\sum_{k=0}^{K-1}\theta_k \Lambda^k) U^T x = \sum_{k=0}^{K-1}\theta_k (U\Lambda^kU^T)x = \sum_{k=0}^{K-1}\theta_k L^kx\]
(Note that \(UU^T = I\)).
Now the filter is \(K\)-localized, according to [2]. Intuitively, assuming the Laplacian is \(L = D-A\), notice that \(L^{K-1} = (D-A)^{K-1}\) contains \(I\) to \(A^{K-1}\), suggesting that the new signal \(\mathbf{h}(\Lambda)x\) contains information from each node to its \((K-1)\)-hop neighbor.
The second modification is to use the Chebyshev polynomials instead of the polynomial basis. This has a number of advantages, such as the use of orthogonal basis, and more importantly, increase in efficiency since the basis can be recursively defined:
(ChebConv replaces \(\Lambda\) with the normalized version \(\tilde{\Lambda}\), but we will not consider this here).
Implications of the ChebConv formulation
As mentioned above, the main implication of this version of graph filtering is that now the convolution affects a local neighborhood of each node, determined by the number of basis used. We will see this effect directly using the NEIGHBORSMATCH problem (Alon & Yahav, 2021).
Observing the locality effect
The NEIGHBORSMATCH problem
Proposed by (Alon & Yahav, 2021), the NEIGHBORSMATCH problem is a task involving synthetically generated graphs. The intention of the problem is to create a task with a specific problem radius (i.e., the required range of interaction between a node and its neighbors for the model to solve the problem). It generates a number of synthetically generated graphs with problem radius \(r\), and a graph neural networks (GNNs) is trained to solve the task. If the GNN has less than \(r\) layers, it cannot solve the problem whatsoever.
Our modification
In our case, we slightly modify the NEIGHBORSMATCH problem to a more simpler configuration. Consider a synthetically generated graph below:
Visualization of a single graph with depth 4.
[Appendix 1: Source code for the simplified NEIGHBORSMATCH] [Appendix 2: Code for generation] [Appendix 3: Code for visualizing]
Figure 1 visualizes a single graph within the dataset generated from the code. These are some features of the graph and the dataset:
The graph is a full binary tree, with the root node colored in red in the Figure.
Each node is assigned with a randomply assigned integer, which serves as the node features (in implementation, the integer is encoded as a one-hot vector).
The assigned integers of the leaf node is different with the rest of the nodes within the graph.
The integers of the leaf node is defined as the class label of the root node.
Consequently, the integer of the root node does not have any relationships with the actual class label.
Similar to the original version, given a dataset consisting of such graphs, we are interested in classifying the root node ‘directly’ by using a GNN model; that is, the GNN model is to compute the class probability of the root by aggregating the neighbor information w.r.t. the root node itself.
A direct consequence of this is that the GNN cannot solve the NEIGHBORSMATCH problem unless it has aggregated information from the leaf node to the root node.
Directly observing the effect of \(K\)
Here, we aim to observe the receptive field generated from the ChebConv using the modified NEIGHBOURSMATCH problem above. In the following experiment, we make a GNN model using exactly one ChebConv layer with a pre-defined value of \(K\). We try to observe whether our GNN model can solve this problem with different depths of the dataset.
Performance heatmap with different depth and K values.
[Appendix 4: Model code] [Appendix 5: Experiment code]
We see in Figure 2 that for different values of depth for the NEIGHBORSMATCH dataset, ChebConv requires a localized filter that can at reach the leaf nodes. This is expected since the information crutial for solving the problem only exists at the feature vectors of the leaf node by design.
Also, since the problem itself is very straightforward, the performance of the models for those that can solve is always near 1. This is despite the fact that we only used one layer of ChebConv without any non-linear actiavation functions.
For the models that could not solve the problem, the performance degrades to a random classifier: The performance is always near \(\approx 1/(\text{num. of class})\).
Chance of redemption: Stacking layers
For ChebConv layers that does not have enough receptive field, we can still make the models solve the problem with a simple modification: stacking the layers. Since each convolutions aggregate information from the neighbors from the receptive field, stacking multiple convolutions will make the information travel more. This design choice is also highlighted in the next paper that we will cover (GCN).
To see this effect, we will build several models using ChebConv layers, each with receptive field of \(K=2\). However, the task will be generated with varying depth values from 1 to 6 . In the first experiment (where we restricted ourselves to a one layer model), the model is bound to fail for depth greater than 1. In this experiment, we will attempt to stack layers for all depth values until the model can finally solve the problem.
Performance heatmap with different depths and the number of layers (K=2).
[Appendix 6: Model code] [Appendix 7: Experiment log]
Here, we can see that stacking enough layers, the model will eventually be able to solve the task. Also notice that for the cases where the model fails, the performance is roughly \(1/(\text{num. of classes})\), indicating that the model is basically a random classifier. Let’s look at an another case where we stack ChebConv layers with \(K=3\):
Performance heatmap with different depths and number of layers (K=3).
[Appendix 8: Experiment log]
As expected, the depth that the model can solve now increases twice as fast per layer compared to the case when \(K=2\). Note that there are no benefits of stacking layers for \(K=1\) for solving NEIGHBORSMATCH as it only learns filters of itself (remember the first term of the Chebyshev polynomial is the identity matrix \(I\)) and does not take neighbor nodes into consideration.
Conclusions
We have observed the main idea of ChebConv, along with its interpretation in terms of receptive field. Running various experiments on the NEIGHBORSMATCH problem, we were able to directly observe the effect of \(K\) on the solvability of the task, which is designed to be solved with a model that can cover the task’s problem radius.
Appendix
Appendix 1
Appendix 2
Appendix 3
Appendix 4
Appendix 5
Appendix 6
Appendix 7
Appendix 8
References
2021
ICLR
On the Bottleneck of Graph Neural Networks and its Practical Implications