Haitao Mao

Research Summary

2024-11-08T00:00:00+00:00

What is the role of graphs in the Age of Foundation Models

Hello, this is Haitao, a final-year Ph.D. candidate at Michigan State University, I am actively seeking industry position starting around May 2025. My research interests include Graph Machine Learning, Recommender System, Large Language Model, and Information Retrieval. More Details can be found in my Resume, Research Statement, and Talk Slides. If you know of any relevant openings, I would greatly appreciate your consideration. Thank you! This article highlights my research contributions and offers an overview of my work.

Graphs describe complicated relationships between different instances, revealing how data instances are interconnected and uncovering collective patterns which are difficult to express with a single element. I have experienced many real-world graph usage, including relationships between cells in the tabular data, primary-foreign key connection in relational database, and how user behaviors are influenced by search engine result pages. However, due to the distinct graph characteristics in various domains, most existing solutions require careful domain-specific architecture designs and train from scratch on each application. It leads to an expertise-intensive process of graph modeling, hindering generalization across graphs from different domains with varying properties.

The era of Foundation Models (FMs) brings versatile model capabilities that reduce the requirement for training from scratch. FMs aim to recognize foundational patterns, enabling models to transfer knowledge across data from different domains and adapt to a wide range of tasks. In this context, two questions arises at the heart of my research: (1) Can graph have similar versatile models requiring less expertise designs? (2) Can graph enhance the utilization of existing foundation models in other modalities?

Revolving on this questions, my research are three folds:

Academic focus: Towards Versatile Graph Models. Versatile Graph Models offer a general foundation, encapsulate complicated graph modeling details, and enable easy adaptation to benefit various downstream graph applications. A detailed research statement can be found here.
Industry experience: Resource-efficient practical solutions via identifying and leveraging underlying relationships among data. Leveraging underlying relationship between instances to enhance machine learning efficiency with better data utilization and domain gap mitigation.
Recent curiousity: LLM mechanism analysis to enhance output reliability. Despite impressive capabilities of LLMs, limitation can be found with superficial LLM responses. My research focuses on understanding this limitation and developing methods to achieve more reliable LLM outputs.

Academic focus: Towards Versatile Graph Models

Outline

The failure of GNNs under distribution shift
LLMs cannot solve graph tasks individually with incapability on capturing structure patterns
Develop versatile graph model via capturing essential network patterns

3.1 Essential structure patterns and underlying relationships

3.2 Versatile structural graph model from a generative perspective

3.3 When additional feature into consideration

3.4 Essential Feature patterns and underlying relationships

3.5 Feature-centric versatile graph model
Align versatile graph models with LLMs for enhanced understanding on different modalities

1. The Failure of GNNs under distribution shift

Problem statement: Tabular metadata tagging in Excel
- Metadata categorizes the role of each cell. A measure contains numerical data suitable for calculations like sum, count. A dimension contains categorical information that can be utilized for filtering and grouping.
- Tagging tabular metadata is a crucial preprocessing step for accurate operating on table fields, supporting advanced data analysis features in Microsoft Excel.
- Why GNN? Beyond the semantic meaning in each individual cell, there also exists underlying hierarchical and paratactic relationships between cells. GNNs are introduced to jointly capture both the semantic content in each cell and relationships among them.
Challenge 1: Structural distribution shift. Given the flexibility and broad applicability of tabular data, their underlying graph structures can vary substantially, resulting in significant distribution shifts and performance degradation when GNNs are applied to new domains.
Challenge 2: Data privacy. Tabular data from different organizations cannot be shared due to the data privacy policy.
New Scenario Definition: Source-free Graph Domain Adaptation. We adapt the source graph model to unlabeled target graphs without requiring access to the labeled source graph.
Algorithm Solution: leveraging non-i.i.d. relationships between nodes in the target graph to adapt the initial discriminative ability from the source domain. Two structural proximity objectives are proposed to enhance prediction consistency.
Usage: help to annnotate tag the tabular data
Conclusion: GNNs are not robust across graphs with structural shifts.

The source training procedure and labeled source graph in the shadow box are not accessible. Our algorithm only includes the left dashed box describing the adaptation procedure. SOGA utilizes the output of the model on the unlabeled target graph to optimize two objectives: Information Maximization and Structure Consistency to adapt the model on the target domain.

My further research reveals that most GNNs struggle to capture (i) both homophily and heterophily patterns simultaneously and (ii) long-range dependency, such as PPR and Katz pattern. These observations motivate me to explore alternative solutions that could replace GNNs without above limitations. My first try is efficient MLP with graph regularization, but it fails after multiple tries.

[1] Source Free Graph Unsupervised Domain Adaptation, Haitao Mao et al. WSDM 2024 Best Paper Honor Mention

2. LLMs cannot be directly applied to solve complicated graph tasks

With the emergence of ChatGPT’s powerful yet largely unexplored capabilities(at that time), I started to wonder whether LLMs could effectively understand graph structures and potentially surspass GNNs.

Conclusion: while LLMs struggle to capture structural patterns effectively, they still offer valuable insights by understanding textual node features, thereby enhancing graph learning.

Detailed graph-LLM pipeline designs:

LLM-as-Enhancer (Embedding Textual Node Attributes)
- provides a high-quality feature space that improves GNN performance.
LLM-as-predictor (In-context Learning)
- transform both node attributes and structure information into text description
- Satisfactory zero-shot performance on node attributes alone.
- Adding structure may even degrade the performance
LLM-as-annotator (a practical solution)
- LLMs can achieve satisfying zero-shot performance but incur high inference costs
- Select representative nodes for LLM to annnotation
- Utilize LLM predictions as pseudo-labels to train downstream GNNs.

Illustration of LLMs for Graph ML.

[1] Exploring the potential of large language models (llms) in learning on graphs, Zhikai Chen, Haitao Mao et al., KDD Exploration 2024

[2] Label-free Node Classification on Graphs with Large Language Models (LLMs), Zhikai Chen, Haitao Mao et al., ICLR 2024

[3] Text-space Graph Foundation Models: A Comprehensive Benchmark and New Insights, Zhikai Chen, Haitao Mao et al., NeurIPS 2024 DB track

3. Develop versatile graph models via capturing essential network patterns

Since the success of LLMs cannot be directly applied to the graph domain, I pioneer the development of versatile Graph Models trained from scratch, capable of generalizing on new instances under distribution shift. To achieve this, (1) I endeavor to elucidate fundamental network patterns prevalent across diverse graphs and investigate their relationships through network science approaches, thereby achieving transferability across graphs. (2) I strive to design more versatile graph models which address the incapability of current GNNs in capturing these fundamental network patterns. The technical keys are three-folds: (1) collecting graph datas with diverse patterns (2) designing a flexible and expressive model architecture (3) establishing suitable training objective training

3.1 Essential structure patterns and underlying relationships

Fundamental structural patterns
- Local Structural Proximity (LSP) corresponds to the similarity of immediate neighborhoods between two nodes.
- Global Structural Proximity (GSP) accounts for the ensemble of all paths between two nodes.
Mathematical details: Design a latent space network model capturing both proximities. Proximity is described with distances in the latent space.
Underlying relationships
- LSP proves to be more effective than GSP.
- GSP remains valuable when LSP information is absent.
The failure of GNNs
- GNNs can capture LSP well while they struggle on graphs with GSP patterns
- the traditional Katz heuristic surpasses GNNs by 10% in a power network.

Local Structural Proximity (LSP) and Global Structural Proximity (GSP) are two fundamental structural patterns. LSP is more effective while GSP plays the role when LSP is absent.

[1] Revisiting Link Prediction: A Data Perspective, Haitao Mao et al., ICLR 2024

3.2 Versatile structural graph model from a generative perspective

Why GNNs fails?
- Data: Benchmark datasets only emphasizes LSP while GSP is largely ignored.
- Model: GNN architecture designs also focus on LSP.
- Objective: BPR loss requires carefuls design on negative samples while improper negative samples lead to learning incapability on capturing GSP patterns.
Solution:
- Data: Pre-train over 100+ datasets across 33 domains, extending diverse datasets beyond commonly used graphs with limited patterns.
- Model: A graph diffusion model guided by multiple graph proximities, e.g., degree for LSP and network entropy for GSP, as guidance inputs.
- Objective: Generative objective inspired by network modeling literature
Model usage
- Directly applicable learned representations for downstream tasks.
- Generate effective data augmentation, improving node classification, link prediction, and graph regression performance.
- Generates synthetic graphs with varied properties, allowing comprehensive GNN performance evaluation.

GNNs fail to capture GSP pattern with a discriminative approach easily over-rely on one pattern. The structural GFM adopt the generative approach achieves more comprehensive graph modeling.

[1] Cross-Domain Graph Data Scaling: A Showcase with Diffusion Models, Haitao Mao et al., 2024

3.3 When additional feature into consideration

Beyond essential structural patterns common to all graphs, many graphs offer high-quality node features contributing additional knowledge.

Feature proximity: nodes with similar attributes tend to be connected.
Relationship with structure patterns: FP and LSP works on different instances, offering complementary effects.
Inspiration: leveraging feature patterns to enhance graph modeling.

[1] Revisiting Link Prediction: A Data Perspective, Haitao Mao et al., ICLR 2024

3.4. Essential Feature patterns and underlying relationships

Fundamental structural patterns: Homophily and heterophily patterns describe the tendencies of neighboring nodes to exhibit similar or dissimilar features
Phenomenon: Real-world networks often exhibit both homophily and heterophily patterns
Mathematical details: A variant of CSBM model to capture both patterns by assigning edges that reflect intra-group similarity and inter-group diversity.
Underlying relationships through mean aggregation
- For homophily patterns, node features remain unchanged (i.e., red features stay red).
- For heterophily patterns, node features flip (i.e., red and green features switch).
- The symmetries on homophily and heterophily patterns differ through mean aggregation
The failure of GNN: GNN can work on either homophily or heterophily, but underperform on the other side

Homophily and heterophily are two fundamental feature patterns. GNNs often excel at one but struggle with the other. The challenge stems from different feature symmetries on different patterns: after message passing, homophily node features stay consistent while heterophily node features flip, which challenges for GNNs to handle both effectively.

[1] Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?, Haitao Mao et al., NeurIPS 2023

3.5. Feature-centric versatile graph model

Why GNN fails? GNNs with fixed aggregation contraint their ability to handle both homophily and heterophily patterns effectively.
Solution:
- Model: Transformer with adaptive neighbor selection
- Data: Improve feature quality which reduces the previous reliance on structure patterns
- Train: large-scale pre-training over 100 million sequences generated from random walk.
Results:
- Selective Attention Mechanism with high weights on useful neighbor nodes
- Satisfied few-shot performance comparable with full supervised learning

The feature-centric GFM employs transformers with a feature-driven approach for selecting relevant neighborhoods. To enhance neighbor selection, I convert node features into textual representations, enabling high-quality embeddings. Masked node modeling is then applied over 100 million instances, ensuring a comprehensive feature learning.

[1] A Pure Transformer Pretraining Framework on Text-attributed Graphs, Yu Song, Haitao Mao et al., 2024

3.6. Align vesatile graph models with LLMs for enhanced understanding on different modalities

Many real-world tasks require comprehensive understanding on different modalities, necessitating the integration of graph knowledge with foundation models from other data modalities. Real-world scenarios I’ve encountered include:

Conversational Recommendation requires understanding both textual user queries and prior interactions between users and items, often modeled as a user-item bipartite graph.
Code Generation & Prograph Optimization requires an integrated understanding of code text, environment configurations and the underlying program logic represented as graphs, e.g., control flow graph, data flow graph.

Challenges on alignment

No natural alignment. Unlike images, which can often be described easily with text, graphs lack straightforward textual descriptions due to their inherent flexibility and ambiguity. Consequently,the superioty of CLIP, which aligns image and text data effectively, can not extend to achieve similar success in the graph domain.
Massive graph vocabulary. Taking recommendation scenarios as an example, each user and item has a unique identifier, leading to a vocabulary size in the billions. Aligning these identifiers with LLMs is challenging, as if is nearly impossible to effectively incorporating them into the LLM’s vocabulary.

To address these issues, I am currently working on quantizing graph knowledge into semantic identifiers, which could help bridge these gaps. I look forward to sharing more on this approach after the paper comes out.

Industry experience: Resource-efficient practical solutions via identifying and leveraging underlying relationships among data

My research maintains industry connections through internships at Microsoft, Snap, and Baidu, along with collaborations with Amazon, Google, Intel, and JP Morgan. My industry mentors have shifted my perspective from technique-driven to a problem-driven approach, highlighting real-world resource-efficiency challenges, including limited data availability, parameter budgets, and computational resources. My internship projects focus on identifying and leveraging underlying relationships to solve practical problems efficiently. Roughly speaking, the underlying relationships (graph structure) can help:

Better Performance when labels vary smoothly over the high-quality underlying graph structure
Better Data Utilization: Enabling semi-supervised learning and missing value imputation via capturing the relationship between instances.
Domain Gap Mitigation via utilizing structural dependency between different domains.

Outline

Search: Text-rich Unbiased Learning to Rank in Baidu
Recommendation

2.1. Multiple-locale Multilingual Session Recommendation in Amazon

2.2. Cross-domain Sequential Recommendation with Generative Retrieval
Tabular Metadata Tagging in Microsoft
Accelerate and Stabilize Neural Network training through insights on inter-neuron relationship

1. Search: Test-rich Unbiased Learning to Rank in Baidu

Problem statement: Unbiased Learning to Rank (ULTR)
- Cost-effective click data: User click behavior data serves as a cost-effective alternative to expert annotation, providing a scalable approach to collect updated trends without requiring labor-intensive annotation processes.
- Biases in click data: User click behavior is inherently biased on the search result page presentation (SERP), for example, users favor clicking documents displayed in higher-ranked positions.
- The purpose of ULTR algorithm: ULTR algorithms aim to mitigate these biases, enabling the use of implicit user feedback for more accurate and unbiased ranking outcomes.
Issue: Out-dated academic datasets (take Yahoo! learning-to-rank challenge dataset as an example)
- Limited SERP Features and User Behaviors: Only position-based features for SERP feature and click behavior are collected, which has led most ULTR algorithms to center their efforts around position-related bias along.
- Outdated query-document relevance features lack incorporation of recent advancements in NLP, limiting their effectiveness for developing ranking algorithms that leverage enhanced semantic understanding for real-world applications.
New collected Baidu-ULTR dataset
- Comprehensive SERP feature and User Behaviors enables an in-depth analysis of multiple display biases.
- Updated Relevance Features utilizes MonoBERT to encode relevance features, enjoying the advanced language modeling techniques for enhanced semantic understanding.
New challenges in the Baidu-ULTR dataset: Existing ULTR algorithms that primarily address position bias show no improvement over models trained directly on click data alone.
Solution: an ULTR algorithm that considering biases from the entire search page presentation
- Suitable User Behavior Model Design
  - Challenge: Many user behavior models are heavily position-focused and do not generalize to other SERP features.
  - Graph model: Represent relationships between user behavior and various SERP features using a directed graph, enabling to describe the user behavior model flexibly.
  - Graph Structure Learning to capture biases from diverse SERP features in a data-driven approach, adapting to the unique characteristics of each feature.
- Unbiased Learning Algorithm: Implement an algorithm to mitigate confounding biases by applying importance reweighting to the learned user behavior model.

(a) A demo explanation of rich page presentation information in Baidu-ULTR. There are 8 presentation features that start from D1 to D8. (b) A demo explanation of rich user behaviors in Baidu-ULTR. There are 18 user behaviors starting from U1 to U18.

[1] A Large Scale Search Dataset for Unbiased Learning to Rank, Haitao Mao et al., NeurIPS 2022

[2] Whole Page Unbiased Learning to Rank, Haitao Mao et al., WebConference 2024

2. Recommendation

2.1 Multiple-locale Multilingual Session Recommendation in Amazon

Problem Statement: Session Recommendation aims to predict next item the user would interact with based on previous item interactions in a single anonymous session
Out-dated academic datasets:
- Limited text-attributes: Existing session recommendation datasets often lacks textual descriptions, which already shows great usages in recommendation including (1) address cold-start issue and facilitate cross-domain transfer (2) Generative Retrieval with semantic id (3) Conversational recommendation with LLMs.
- Small item set lacks of real-world item diversity. Moreover, efficiency challenges are also overlooked while many algorithms train with cross-entropy loss across entire item set which is computationally intensive and impractical for real-world large item sets.
New collected Amazon-2M Session Recommendation dataset
- 100 times larger item set with multiple locale diversity
- Rich Multilingual Text Attributes enables exploration of cross-lingual and cross-regional transferability, transferring knowledge from large locales to facilitate the recommendation for underrepresented locales.

[1] Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation, Haitao Mao et al., NeurIPS 2023

2.2. Cross-domain Sequential Recommendation with Generative Retrieval

Ongoing project explores a novel approach for cross-domain recommendation by leveraging graph relationships. I’m excited to share more details once the paper is released early next year.

4. Accelerate and Stabilize Neural Network training through insights on neuron relationship

Neural networks, with their vast parameter scale, often suffer from slow and potentially unstable training. Reducing training resource requirements remains a key practical challenge. During my internship at Microsoft, I introduced a novel perspective centered on individual neurons, inspired by the insight that permutations of hidden neurons within the same layer leave the input-output mapping unchanged. By treating each neuron as a fundamental unit of analysis, I modified neuron-specific behaviors to improve both training efficiency and stability.

Permute neuron ids does not change the input-output mapping

Neuron Campaign Strategy for Accelerating training procedure

Goal: An initialization can promote model convergence, instead of only preventing training failure, e.g., gradient vanishing or explosion.
Neuron Campaign Strategy: initiate a model with primary discriminative ability
- Create a large candidate neuron set utilizing traditional initialization strategies like Xavier
- Select neurons initialized with primary discriminative ability
- Combine winning neurons as the neural network initialization.

Stablize Neuron Response with better generalization

Analysis: neuron with similar responses to instances of the same class leads to better generalization
Solution: Regularization term to minimize intra-class variance in neuron responses, thereby stabilizing neuron activation patterns.
Results: promotes a more stable training process, leading to faster convergence and improved generalization across tasks.

[1] Neuron Campaign for Initialization Guided by Information Bottleneck Theory, Haitao Mao et al., CIKM 2021 Best Paper Award

[2] Neuron with Steady Response Leads to Better Generalization, Haitao Mao et al., NeurIPS 2022

LLM Mechanism Analysis for more reliable LLM outputs

ICL Mechanism Analysis

In Context Learning (ICL) serves as a fundamental emerging capacity, underpinning a wide range of complicated abilities.
ICL learns from a few prompt examples, enabling downstream generalization without requiring gradient updates
Mystery on how ICL achieves successes:
- Unexplainable Phenomenon: sensitivity to the ICL sample order, robust to wrong input-label mapping
- Unclear origin of ICL ability: Influence by pre-training data qualities, model-scale and task difficulties
- Core question: Can LLMs really learn new skills from ICL examples?
Key analysis approach through the lens of data generation functions
- Skill is formulated as data generation function
- Skill recognition refers to the ICL mechanism which selects one learned data generation function previously seen during pretraining
- Skill learning refers to the ICL mechanism which can learn new data generation functions from ICL examples distinguished by whether LLMs can learn a new data generation function in context.
Suitable Data generation functions enables
- Theoretical Analysis with mathematical modeling with HMM, LDA, and other functions in the NLP domain
- *Controllable empiricial analysis on synthetic data generated by different functions

[1] A Data Generation Perspective to the Mechanism of In-Context Learning, Haitao Mao et al., 2024

Convergence guarantee on iteratively applying intrinsic self-correction

Intrinsic self-correction ability: LLMs can improve their responses when instructed with only the task’s goal without specific details about potential issues in the response
Example Instruction: “Please ensure that your answer is unbiased and does not rely on stereotypes”
Advantage: efficient than other methods necessitating feedback from humans, tools, or more powerful LLMs.
Challenge: Intrinsic self-correction may not always be effective, as it has the potential to revise an initially correct response into an incorrect one
Mechanisms for Effective Self-Correction
- When a self-correction instruction is given
  - Posiive concepts in LLM will be activated
  - in turn reduces model uncertainty, which decreases and stabilize the calibration error
  - lead to converged self-correction performance with stable improvement
- Skill recognition explanation: multiple-round instructions locate the desired skill
Remaining Challenges: While skill recognition helps align LLM responses with calibrated abilities, it primarily taps into the model’s existing, pre-trained skills. Due to the inherent limitations of these capabilities, the improvements often remain superficial:
- Incomplete Correction: Self-correction struggles to fully eliminate undesirable content embedded in intermediate hidden states.
- Superficial Modification: LLMs typically add non-toxic text to the original response instead of genuinely revising problematic parts.

(1) Superficial: LLMs tend to append non-toxic text but do not modify previous responses (2)Convergence: Instruction reduces model uncertainty and improves positive concept, guiding LLMs to converged less toxic performance.

[1] Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis, Guanliang Liu, Haitao Mao et al., EMNLP 2024

[2] On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept, Haitao Mao et al., 2024

When do GNNs work and when not in Node Classification

2023-10-15T00:00:00+00:00

When do GNNs work and when not in Node Classification

1. Introduction

Graph is a very basic data structure we learned in the Data Structures and Algorithms course. It naturally represents each instance as the node, and each edge denotes the pairwise relationship. It can be a natural representation of arbitrary data. For instance, in the computer vision domain, the image can be viewed as a grid graph. In the natural language processing domain, the sentence can be viewed as a path graph. In AI4Science, the graph can easily adapt to all scientific problems.

Graph Neural Networks (GNNs) is proposed which utilizes the strong capability of Neural Network on graph structural data. GNN architectures have found wide applicability across a myriad of contexts, with graph data drawn from diverse sources like social networks, citation networks, transportation networks, financial networks, to chemical molecules. Nonetheless, there is no consistent winning solution across all datasets, owing to the varied concepts that these graphs encode. For instance, GCN may work well on particular social networks while falling short in molecule graphs for it cannot capture particular key patterns on the graph.

Motivated by such a problem, in this blog, we focus on the question that when do GNN work and when not? In light of these questions, we

provide a thorough understanding the properties of graph datasets and
how GNNs work on datasets with different properties.

The insights gleaned from this understanding will serve as acatalyst for the advancement of model development for novel graph datasets, thereby fostering the wider adoption of GNNs in emerging applications.

Our typical findings are:

GNNs can actually do better: Homophily, nodes connect with similar ones, is not a necessity for the success of GNNs. GNNs can still work on various heterophily datasets (nodes connect with dissimilar ones). Paper details can be found in pdf.
GNN may actually be worse: GNNs may perform even worse than MLP in certain circumstances. Paper details can be found in pdf.

We will dive deep into both the underlying graph data mechanism and model mechanism in this blog and provide a full picture of Graph Neural Networks for node classification.

If you have any questions on the blog, feel free to send email to haitaoma@msu.edu

2. Preliminaries

In this section, we will provide a brief introduction on task, model, and data properties we focus on and the main analysis tool we utilize.

2.1 Task: Semi-supervised Node Classification (SSNC)

The semi-Supervised Node Classification (SSNC) task is to predict the categories or labels of the unlabeled nodes based on the graph structure and the labels of the few labeled nodes. We normally use message propagation methods through the connections in the graph to make educated guesses about the labels of the unlabeled nodes. It has wide applications in inferring node attributes, social influence prediction, traffic prediction, air quality prediction, and so on.

2.2 Models: Graph Neural Networks (GNNs)

Graph neural networks learn node representations by aggregating and transforming information over the graph structure. There are different designs and architectures for the aggregation and transformation, which leads to different graph neural network models.

We will mainly introduce GCN, a fundamental yet representative model. For one particular node, GCN aggregates the transformed features from its neighbors and does the averaging process.

$\textbf{Graph Convolutional Network (GCN).}$

From a local perspective of node $i$, GCN’s work can be written as a feature averaging process:

\[\mathbf{h}_i = \frac{1}{d_i}\sum_{j \in \mathcal{N}(i)}\mathbf{Wx}_j\]

_where $\mathbf{h}_i$ denotes the aggregated feature. $d_i$ denotes the degree of node $i$, $\mathcal{N}(i)$ denotes the neighbors of node $i$, i.e., $d_i = \left| \mathcal{N}(i) \right|$. $\mathbf{W}^{(k)} \in \mathbb{R}^{l \times l}$ is a parameter matrix to transform the features, while $\mathbf{x}_j$ denotes the initial feature of node $j$. Notably, the weight transformation step will not be the focus of our paper since it is general in deep learning. We typically focus on the aggregation The key reason for the aggregation step is that people assume that the model can be neighborhood nodes are similar to the center node, which is called homophily assumption. Therefore, aggregation can benefit from such similarity and achieve a smooth and discriminative representation.

2.3 Data properties: Homophily and Heterophily

Recent works reveal different graph properties, e.g., degree, the length of the shortest path, could influence the effectiveness of GNN. Among them, people recognize that homophily and heterophily are the most important properties, which are the key focus of this paper. People generally believe that the neighborhood nodes are similar to the center node, which is called homophily. Therefore, aggregation can benefit from neighborhoods to achieve a more smooth and discriminative representation.

Homophily. If all edges only connect nodes with the same label, then this property is called Homophily, and the graph is call a Homophilous graph.

In Fig.1, the number denotes the label, and different colors denote distinct features. It is shown that all nodes with similar features have edges connected, and also share the same label, illustrating a perfect homophily.

Fig.1 A Homophily Example.

Heterophily. If all edges only connect nodes with different labels, then this kind of attribute is called Heterophily and the graph is called a Heterophilous graph. Fig.2 below shows a heterophilous graph. In this toy example, each node with label 0(1) only connects nodes with label 1(0).

Fig.2 A Heterophily Example.

Graph Homophily Ratio.

Given a graph $\mathcal{G} = {\mathcal{V, E} }$ and node label vector $y$, the edge homophily ratio is defined as the fraction of edges that connect nodes with the same labels. Formally, we have:_

\[h(\mathcal{G}, \{y_i; i \in \mathcal{V}\}) = \frac{1}{\left| \mathcal{E} \right| } \sum_{(j, k) \in \mathcal{E}} \mathbb{I}(y_j = y_k)\]

where $

\mathcal{E}

$ is the number of edges in the graph and $\mathbb{I}(\cdot)$ denotes the indicator function.

A graph is typically considered to be highly homophilous when $0.5 \le h(\cdot) \le 1$. On the other hand, a graph with a low edge homophily ratio ($0 \le h(\cdot) < 0.5$) is considered to be heterophilous.

Node Homophily Ratio.

Node homophily ratio is defined as the proportion of a node’s neighbors sharing the same label as the node. It is formally defined as:

\[h_i = \frac{1}{d_i} \sum_{j \in \mathcal{N}(i)} \mathbb{I}(y_j = y_i)\]

where $\mathcal{N} (i)$ denotes the neighbor node set of $v_i$ and $d_i = | \mathcal{N}(i) |$ is the cardinality of this set.

Similarly, node $i$ is considered to be homophilic when $h_i \ge 0.5$, and is considered heterophilic otherwise. Moreover, this ratio can be easily extended to higher-order cases $h_{i}^{(k)}$ by considering $k$-order neighbors $\mathcal{N}_k(v_i)$.

2.4 Target: What is a discriminative representation?

To examine whether GNNs perform well or not, we focus on whether GNN can encode a discriminative representatation. For instance, the ideal discrminative representation can be described as: (1) Cohension: nodes with the same label are mapped into similar representation (2) Seperation: nodes with different labels are mapped into dis-similar representations. The Fig.3 below illustrates an example of high cohension and good seperation, where each color indicates one class. We can observe that each cluster is in the same class while different clusters are distant from each other. We can then expect to use a simple linear classifier to achieve high performance, which shows an ideal representation.

Fig.3 A simple linear classifier.

3. GNN can actually do better?

In this section, we illustrate that GNNs can actually do better: Homophily, nodes connect with similar ones, which is not a necessity for the success of GNNs. GNNs can still work on various heterophily datasets (nodes connect with dissimilar ones). To achieve this goal, we focus on whether GNN can achieve discriminative representation in different settings.

3.1 Toy example & theoritical analysis

In this subsection, we examine when GCN can map nodes with the same label to similar embeddings. We first play with toy graph examples, homophily and heterophily graphs, which are shown in Section 2.3. In particular, we examine node representations from different classes after the GNN.

GCN under homophily: The aggregation process for the homophily graph is shown in Fig. 4, where node color and number represent node features and labels, respectively. We can easily observe that, after mean aggregation, all the nodes with class 1 are in blue, and class 2 in red, indicating a good discriminative ability. x

Fig.4 GCN under homophily.

GCN under heterophily The aggregation process for the homophily graph is shown in Fig. 5, where node color and number represent node features and labels, respectively. We can easily observe that there appears a color alternation. Before aggregation, all the nodes with class 1 are in blue, and class 2 is in red. In contrast, all the nodes with class 1 are in red, and class 2 is in blue after mean aggregation. Nonetheless, such alternation does not influence the discriminative ability. Notably, the nodes with the same class are still in the same color while nodes with different classes are in different colors, indicating a good discriminative ability.

Fig.5 GCN under heterophily.

More rigorously, we provide a theoretical understanding of what kinds of graphs could benefit from the GNNs and how. GNN can perform well on the graphs satisfying:

Feature: Nodes from the same graph are samples from the same distribution $\mathcal{F}_{y_i}$
Structure: Nodes from the same graph follows the same neighborhood distribution $\mathcal{D}_{y_i}$. Homophily graphs fit such an assumption, for they are more likely to be connected with nodes in the same class. Some heterophily graphs can also fit in such an assumption. For instance, Figure 5 shows that nodes in class 1 are connected with nodes in class 2 and vice versa.

The rigorous theoretical analysis are shown as follows: (You can skip the following part for heavy math!)

Consider a graph $\mathcal{G} = \mathcal{V}, \mathcal{E}, { \mathcal{F}_{c}, c \in \mathcal{C} }$, ${ \mathcal{D}_{c}, c \in \mathcal{C}} $.

For any node $i\in \mathcal{V}$, the expectation of the pre-activation output of a single GCN operation is given by $\mathbb{E}[{\bf h}_i] = {\bf W}\left( \mathbb{E}_{c\sim \mathcal{D}_{y_i}, {\bf x}\sim \mathcal{F}_c } [{\bf x}]\right).$

and for any $t>0$, the probability that the distance between the observation ${\bf h}_i$ and its expectation is larger than $t$ is bounded by

\[\mathbb{P}\left( \|{\bf h}_i - \mathbb{E}[{\bf h}_i]\|_2 \geq t \right) \leq 2 \cdot l\cdot \exp \left(-\frac{ deg(i) t^{2}}{ 2\rho^2({\bf W}) B^2 l}\right)\]

where $l$ denotes the feature dimensionality and $\rho({\bf W})$ denotes the largest singular value of ${\bf W}$, $B\geq\max _{i, j}|\mathbf{X}[i, j]|$.

We can than have the rigorous conclusion that the inner-class distance (distance between $h_i$ the expectation in the same class $\mathbb{E}[h_i]$) on the GCN embedding is small with a high probability, which is due to the sampling from its neighborhood distribution $\mathcal{D}_{y_i}$. Notably, the key step in the proof is the Hoeffding inequality. Details can be found in the paper.

3.2 Empirical evidence

To further verify the validity of the theoretical results, we provide more empirical evidence as follows. In particular, we manually add synthetic edges to control the homophily ratio of a graph and examine how the performance varies.

When adding synthetic heterophily edges on a homophily graph, there are two typical things to control:

The homophily ratio: how many heterophily edges are added in a homophily graph
The noisy ratio $\gamma$: how many edges do not follow the same neighborhood distribution $\mathcal{D}_{y_i}$. If the noisy ratio $\gamma$ is larger, the graph will be away from the condition that GNN can work well, leading to a poor performance.

As we insert heterophilous edges, the graph homophily ratio will also continuously decrease. The results are plotted in Fig.6.

Fig.6 Accuracy of GCN on the synthetic graph with various homophily ratios.

Each point on the plot in Fig.6 represents the performance of GCN model and the corresponding value in the $x$-axis denotes the homophily ratio. The point with homophily ratio $h=0.81$ denotes the original $Cora$ graph, i.e., $K=0$.

The observations are shown as follows:

When $\gamma=0$, all edges are inserted according to the distinguished neighborhood distribution, and we observe the classification performance shows a $V$-shape pattern. This clearly demonstrates that the GCN model can work well on heterophilous graphs under certain conditions.
When $\gamma>0$, if the noise level $\gamma$ is not so large, we can still observe the $V$-shape: e.g. $\gamma = 0.4$; this is because the designed pattern is not totally dominated by the noise. However, when $\gamma$ is larger, adding edges will constantly decrease the performance, as nodes of different classes have indistinguishably similar neighborhoods.

The experiment verifies our findings. If the neighborhood follows a similar distribution, GCN is still able to perform well under extreme heterophily. However, if we introduce noise to the neighborhood distribution, the effectiveness of GCN will not be guaranteed.

4. GNN may actually do worse

In section 3, we discuss the scenario when GNN can do well, including both homophily and heterophily graphs. All the above analyses are from a graph (global) perspective, verifying that the GNN can achieve overall performance gain. However, when we look closer into a node (local) perspective, we find the overlooked vulnerability of GNNs.

4.1 Preliminary study in node level

Instead of understanding from a graph perspective, the following analyses focus on nodes in the same graph but with different properties. We first plot the distribution of node homophily ratio on different datasets, shown in Fig.7. We typically include two homophily graphs and two heterophily ones. Additional results on ten different datasets can be found in the original paper. $h$ in the brackets indicating the graph homophily ratio. The $h_{node}$ on the $x$-axis denotes the node homophily ratio. We can clearly observe that:

Regardless of global homophily and heterophily, there are both homophily nodes and heterophily nodes.
In the homophily graph, most of the nodes are homophily nodes, where we consider the homophily patten in the homophily graph the majority pattern. Contrastively, the heterophily one will be the minor pattern.

Fig.7 Node homophily ratio distributions. All graphs exhibit a mixture of homophilic and heterophilic nodes despite various graph homophily ratios h.

Equipped with the analysis of node-level data patterns, we then investigate how GNN performs on nodes with different patterns. In particular, we compare GCN with MLP-based models since they only take the node features as the input, ignoring the structural patterns. If GCN performs worse than MLP, it indicates the vulnerability of GNNs. Experimental results are illustrated in Fig. 8.

Fig.8 Performance comparison between GCN and MLP-based models. Each bar represents the accuracy gap (MLP-based model minus GCN).

We can observe that:

In the homophily graph, GCN works better on the homophily nodes but underperforms on the heterophily nodes.
In the heterophily graph, GCN works better on the heterophily nodes but underperforms on the homophily nodes. Overall speaking, GNNs can work well on nodes in major pattern, but fails in minority ones. We then focus on investigating how such performance disparity happens.

4.2 Toy example & theoritical analysis

Similar to Section 3.2, we first conduct an analysis on a similar toy example. This time, instead of considering GNN under homophily and heterophily separately, we take the homophily and heterophily patterns together into consideration. The illustration is shown in Figure 9. The aggregation process for the homophily graph is shown in Fig. 4, where node color and number represent node features and labels, respectively.

Fig.9 toy example on both homophily and heterophily patterns.

We can observe that when considering the homophily and heterophily together:

Before aggregation: nodes in class 0 are all in the blue feature, while nodes in class 1 are all in the red feature.
After aggregation: nodes in class 0 are with both blue and red, and a similar thing in class 1. It indicates a large intra-class difference.
After aggregation: There are some nodes with the feature blue in class 0 and some in class 1. It could be impossible to distinguish them with such small inter-class differences.

The above observations on the toy model show that GNN cannot work well on both homophily and heterophily ones. Then we further ask if GNN can learn homophily or heterophily ones well. The answer will be the majority ones in the training set.

Motivated by the toy example, we then provide theoretical understanding rigioursly from a node level. We find that two keys on test performance are:

The aggregated feature distance between train and test nodes aggregated feature distance $\epsilon = \max_{ j \in V_m } \min_{ i \in V_{ \text{tr} } } | g_i(X, G) - g_j(X, G) |_2$ between test node subgroup $V_m$ and training nodes $V_{\text{tr}}$, where $g_i(X, G)$ is the hidden representation for node $i$.
The homophily ratio difference $|h_\text{tr} - h_m|$.

The following theorem is based on the PAC-Bayes analysis, showing that both large aggregation distance and homophily ratio difference between train and test nodes lead to worse performance. (You can skip the following part for heavy math!)

The theory typically aims to bound the generalization gap between the expected margin loss $\mathcal{L}_{m}^{0}$ on test subgroup $V_m$ for a margin $0$ and the empirical margin loss $\hat{\mathcal{L}}_{\text{tr}}^{\gamma}$on train subgroup $V_{\text{tr}}$ for a margin $\gamma$. Those losses are generally utilized in PAC-Bayes analysis. The formulation is shown as follows:

Theorem (Subgroup Generalization Bound for GNNs):

Let $\tilde{h}$ be any classifier in the classifier family $\mathcal{H}$ with parameters ${ \tilde{W}_{l} } _{l=1}^{L}$ .

For any $0< m \le M$, $\gamma \ge 0$, and large enough number of the training nodes $N_{\text{tr}}=|V_{\text{tr}}|$, there exist $0<\alpha<\frac{1}{4}$ with probability at least $1-\delta$ over the sample of $y^{\text{tr}} = { y_i } $, $i \in V_{\text{tr}}$ we have:

\[\mathcal{L}_m^0(\tilde{h}) \le \mathcal{L}_\text{tr}^{\gamma}(\tilde{h}) + O\left( \underbrace{\frac{K\rho}{\sqrt{2\pi}\sigma} (\epsilon_m + |h_\text{tr} - h_m|\cdot \rho)}_{\textbf{(a)}} + \underbrace{\frac{b\sum_{l=1}^L\|\widetilde{W}_l\|_F^2}{(\gamma/8)^{2/L}N_\text{tr}^{\alpha}}(\epsilon_m)^{2/L}}_{\textbf{(b)}} + \mathbf{R} \right)\]

The bound is related to three terms:

(a) describes both large homophily ratio difference $|h_{\text{tr}} - h_m|$ and large aggregated feature distance $\epsilon = \max_{j\in bV_m}\min_{i\in V_{\text{tr}}} |g_i(X, G)-g_j(X, G)|_2$ between test node subgroup $V_m$ and training nodes $V_{\text{tr}}$ lead to large generalization error. $\rho= |\mu_1 - \mu_2 |$denotes the original feature separability, independent of structure. $K$ is the number of classes.

(b) further strengthens the effect of nodes with the aggregated feature distance $\epsilon$, leading to a large generalization error.

(c) $R$ is a term independent with aggregated feature distance and homophily ratio difference, depicted as $\frac{1}{N_\text{tr}^{1-2\alpha}} + \frac{1}{N_\text{tr}^{2\alpha}} \ln\frac{LC(2B_m)^{1/L}}{\gamma^{1/L}\delta}$, where $B_m= \max_{i\in V_\text{tr}\cup V_m}|g_i(X,G)|_2$ is the maximum feature norm. $\mathbf{R}$ vanishes as training size $N_0$ grows.

Our theory suggests that both homophily ratio difference and aggregated feature distance to training nodes are key factors contributing to the performance disparity. Typically, nodes with large homophily ratio differences and aggregated feature distance to training nodes lead to performance degradation.

Empirical evidence

To further verify the validity of the theoretical results, we provide more empirical evidence showing the empirical performance disparity. In particular, we compare the performance of different node subgroups divided with both homophily ratio difference and aggregated feature distance to training nodes. For a test node $i$, we measure the node disparity by

The aggregated feature distance: selecting the closest training node $s_1 = \text{arg}\min_{v\in V_0} ||\mathbf{F}^{(2)}_u-\mathbf{F}^{(2)}_v||$.
The homophily ratio difference $s_2 = |h^{(2)}_u - h^{(2)}_v|$.

We then sort test nodes in terms of $s_1$ and $s_2$ and divide them into 5 equal-binned subgroups accordingly. We include popular GNN models including GCN, SGC (Simplified Graph Convolution), GAT (Graph Attention Network), GCNII (Graph Convolutional Networks with Inverse Inverse Propagation), and GPRGNN (Generalized PageRank Graph Neural Network). The Performance of different node subgroups is presented in Fig.9. We note a clear test accuracy degradation with respect to the increasing differences in aggregated features and homophily ratios.

Fig.9 Test accuracy disparity across node subgroups by aggregated feature distance and homophily ratio difference to training nodes. Each figure corresponds to a dataset, and each bar cluster corresponds to a GNN model. A clear performance decrease tendency can be found from subgroups 1 to 5 with increasing differences to training nodes.

We then conduct an ablation study that only considers aggregated features distance and homophily ratios in Figures 10 and 11, respectively. We can observe that the decrease tendency disappears in many datasets. Only combining these factors together provides a more comprehensive and accurate understanding of the reason for GNN performance disparity.

Fig.10 Test accuracy disparity across node subgroups by aggregated-feature distance to train nodes. A clear performance decrease tendency can be found from subgroups 1 to 5 with increasing differences to training nodes

Fig.11 Test accuracy disparity across node subgroups by homophily ratio difference to train nodes. A clear performance decrease tendency can be found from subgroups 1 to 5 with increasing differences to training nodes.

5. Applications

Inspired by the findings, we investigate the effectiveness of deeper GNN models on SSNC tasks.

Deeper GNNs enable each node to capture a more complex higher-order graph structure than vanilla GCN, by reducing the over-smoothing problem. Deeper GNNs empirically exhibit overall performance improvement. Nonetheless, which structural patterns deeper GNNs can exceed and the reason for their effectiveness remains unclear.

To investigate this problem, we compare vanilla GCN with different deeper GNNs, including GPRGNN, APPNP, and GCNII, on node subgroups with varying homophily ratios. Experimental results are shown in Fig.11. We can observe that deeper GNNs primarily surpass GCN on minority node subgroups with slight performance trade-offs on the majority node subgroups. We conclude that the effectiveness of deeper GNNs majorly contributes to improved discriminative ability on minority nodes.

Fig.11 Performance comparison between GCN and deeper GNNs. Each bar represents the accuracy gap on a specific node subgroup exhibiting a homophily ratio within the range specified on the x-axis.

Having identified where deeper GNNs excel, reasons why effectiveness primarily appears in the minority node group remain elusive. Since the superiority of deeper GNNs stems from capturing higher-order information, we further investigate how higher-order homophily ratio differences vary on the minority nodes, denoted as, $|h_u^{(k)}-h_v^{(k)}|$, where node $u$ is the test node, node $v$ is the closest train node to test node $u$. We concentrate on analyzing these minority nodes $V_{\text{mi}}$ in terms of default one-hop homophily ratio $h_u$ and examine how $\sum_{u\in V_{\text{mi}}} |h_u^{(k)}-h_v^{(k)}|$ varies with different $k$ orders.

Experimental results are shown in Fig.12, where a decreasing trend of homophily ratio difference is observed along with more neighborhood hops. The smaller homophily ratio difference leads to smaller generalization errors with better performance.

Fig.12 Multiple hop homophily ratio differences between training and minority test nodes

6. Conclusion & suggestions & future work

In this blog, we investigate when GNN works and when not. We find that the effectiveness of vanilla GCN is not limited to the homophily graph. Nonetheless, the vulnerability is hidden under the success of GNN. We typically provide some suggestions before you build your own solution to the graph problem.

Before starting, think carefully about your data properly
Remember the drawback of GNNs. Not always good!

We remain some questions for future works:

Solve the drawback of GNNs
Inspire new principled applications on GNNs

Reference

[1]Ma, Yao and Jiliang Tang. “Deep learning on graphs.” Cambridge University Press, 2021.

[2]Ma, Yao, Xiaorui Liu, Neil Shah, and Jiliang Tang. “Is homophily a necessity for graph neural networks?.” arXiv preprint arXiv:2106.06134 (2021).

[3]Mao, Haitao, Zhikai Chen, Wei Jin, Haoyu Han, Yao Ma, Tong Zhao, Neil Shah, and Jiliang Tang. “Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?.” arXiv preprint arXiv:2306.01323 (2023).

[4]Kipf, Thomas N., and Max Welling. “Semi-supervised classification with graph convolutional networks.” arXiv preprint arXiv:1609.02907 (2016).

[5]Hamilton, Will, Zhitao Ying, and Jure Leskovec. “Inductive representation learning on large graphs.” Advances in neural information processing systems 30 (2017).

[6]Xu, Keyulu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. “How powerful are graph neural networks?.” arXiv preprint arXiv:1810.00826 (2018).

[7]Fan, Wenqi, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. “Graph neural networks for social recommendation.” In The world wide web conference, pp. 417-426. 2019.

[8]Zhu, Jiong, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. “Beyond homophily in graph neural networks: Current limitations and effective designs.” Advances in neural information processing systems 33 (2020): 7793-7804.

Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs

2023-08-04T00:00:00+00:00

论文网址：https://arxiv.org/abs/2307.03393
代码地址：https://github.com/CurryTang/Graph-LLM

图是一种非常重要的结构化数据，具有广阔的应用场景。在现实世界中，图的节点往往与某些文本形式的属性相关联。以电商场景下的商品图(OGBN-Products数据集)为例，每个节点代表了电商网站上的商品，而商品的介绍可以作为节点的对应属性。在图学习领域，相关工作常把这一类以文本作为节点属性的图称为文本属性图(Text-Attributed Graph, 以下简称为TAG)。TAG在图机器学习的研究中是非常常见的, 比如图学习中最常用的几个论文引用相关的数据集都属于TAG。除了图本身的结构信息以外，节点对应的文本属性也提供了重要的文本信息，因此需要同时兼顾图的结构信息、文本信息以及两者之间的相互关系。然而，在以往的研究过程中，大家往往会忽视文本信息的重要性。举例来说，像PYG与DGL这类常用库中提供的常用数据集(比如最经典的Cora数据集)，都并不提供原始的文本属性，而只是提供了嵌入形式的词袋特征。在研究过程中，目前常用的 GNN 更多关注于对图的拓扑结构的建模，缺少了对节点属性的理解。

相比于之前的工作，本文主要研究如何更好地处理文本信息，以及不同的文本嵌入与GNN结合后如何影响下游任务的性能。要更好地处理文本信息，那目前最流行的工具便非大语言模型(LLM)莫属(本文考虑了BERT到GPT4这些在大规模语料上进行了预训练的语言模型，因此使用LLM来泛指这些模型)。相比于TF-IDF这类基于词袋模型的文本特征，LLM有以下这几点潜在的优势。

首先，LLM具有上下文感知的能力，可以更好地处理同形不同意的单词(polysemous)。
其次，通过在大规模语料上的预训练，LLM一般被认为有更强的语义理解能力，这点可以从其在各类NLP任务上卓越的性能体现出来。

考虑到LLM的多种多样性，本文的目标是针对不同种类的LLM设计出合适的框架。鉴于LLM与GNN的融合问题，本文把LLM首先分类为了嵌入可见与嵌入不可见两类。像ChatGPT这类只能通过接口进行交互的LLM就属于后者。其次，针对嵌入可见类的LLM，本文考虑三种范式：

以BERT为代表的基于encoder-decoder结构的预训练语言模型。这类模型一般需要在下游数据进行微调。
以Sentence-BERT为代表的句子嵌入模型，这类模型一般在第一类模型的基础上进行了进一步的有监督/无监督训练，不需要针对下游数据进行微调。本文也考虑了以Openai的text-ada-embedding为代表的商业嵌入模型。
以LLaMA为代表的开源decoder-only大模型，这类模型一般会比第一类模型有大得多的参数量。考虑到微调的成本与灾难性遗忘的存在，本文主要评测了未经微调的底座模型。

对于这些嵌入可见的大模型，可以首先用它们来生成文本嵌入，然后将文本嵌入作为GNN的初始特征从而将两类模型融合在一起。然而，对于嵌入不可见的ChatGPT等LLM，如何将它们强大的能力应用于图学习相关的任务便成为了一个挑战。

针对这些问题，本文提出了一种将LLM应用到图学习相关任务的框架，如下图1与图2所示。对于第一种模式LLMs-as-Enhancers,主要是利用大模型的能力对原有的节点属性进行增强，然后再输入到GNN模型之中来提升下游任务的性能。针对嵌入可见的LLM，采取特征级别的增强，然后采用层级或迭代式(GLEM, ICLR 2023)的优化方法将语言模型与GNN结合起来。对于嵌入不可见的LLM，采取文本级别的增强，通过LLM对原有的节点属性进行扩充。考虑到以ChatGPT为代表的LLM的零样本学习与推理能力，本文进一步探索了利用prompt的形式来表示图节点的属性与结构，然后利用大模型直接生成预测的模式，将这种范式称为LLMs-as-Predictors。在实验部分，本文主要采用了节点分类这一任务作为研究对象，我们会在最后讨论这一选择的局限性，以及拓展到别的任务上的可能。接下来，顺延着论文中的结构，在这里简要分享一下各种模式下有趣的发现。

利用LLM进行特征增强：LLMs-as-Enhancers

首先，本文研究利用LLM生成文本嵌入，然后输入到GNN中的模式。在这一模式下，根据LLM是否嵌入可见，提出了特征级别的增强与文本级别的增强。针对特征级别的增强，进一步考虑了语言模型与GNN之间的优化过程，将其进一步细分为了级联式结构(cascading structure)与迭代式结构(iterative structure)。下面分别介绍两种增强方法。

特征级别的增强

对于特征级别的增强，本文考虑的主要是语言模型、GNN、以及优化方法三个因素。从语言模型上来说，本文考虑了以Deberta为代表的预训练语言模型、以Sentence-BERT为代表的开源句子嵌入模型、以text-ada-embedding-002为代表的商业嵌入模型，以及以LLaMA为代表的开源大模型。对于这些语言模型，本文主要从模型的种类以及模型的参数规模来考量其对下游任务的影响。

从GNN的角度来说，本文主要考虑GNN设计中的消息传递机制对下游任务的影响。本文主要选取了GCN,SAGE与GAT这两个比较有代表性的模型，针对OGB上的数据集，本文选取了目前排行榜上名列前茅的模型RevGAT与SAGN。本文也纳入了MLP对应的性能来考察原始嵌入的下游任务性能。

从优化方法的角度，本文主要考察了级联式结构与迭代式结构。对于级联式结构，本文考虑直接通过语言模型输出文本嵌入。对于那些规模较小可以进行微调的模型，本文考虑了基于文本的微调与基于结构的自监督训练(ICLR 2022, GIANT)。不管是哪种方式，最后会得到一个语言模型，然后利用它来生成文本嵌入。这一过程中，语言模型与GNN的训练是分开的。对于迭代式结构，本文主要考察GLEM方法(ICLR 2023)，它使用EM和变分推断来对GNN和语言模型进行迭代式的共同训练。

在实验部分，本文选取了几个有代表性的常用TAG数据集，具体的实验设定可以参考我们的论文。接下来，首先展示这一部分的实验结果(鉴于空间有限，在这里展示了两个大图上的实验结果)，然后简要讨论一些有意思的实验结果。

从实验结果来看，有以下几个有意思的结论。

第一，GNN对不同的文本嵌入有截然不同的有效性。特别明显的一个例子发生在Products数据集上，以MLP作为分类器时，经过微调的预训练语言模型Deberta-base的嵌入要比TF-IDF的结果好很多。然而，当使用GNN模型后，两者的差异很小，特别是使用SAGN模型时TF-IDF的表现要更好。这一现象可能与GNN的过光滑、过相关性有关，但目前还没有比较完整的解释，因此也是一个有意思的研究课题。

第二，使用句子向量模型作为编码器，然后与GNN级联起来，可以获得很好的下游任务性能。特别是在Arxiv这个数据集上，简单将Sentence-BERT与RevGAT级联起来，就可以达到接近GLEM的性能，甚至超过了做了自监督训练的GIANT。注意，这并不是因为用了一个参数量更大的语言模型，这里使用的Sentence-BERT为MiniLM版本，甚至比GIANT使用的BERT参数量更小。这里可能的一个原因是基于Natural Language Inference(NLI)这个任务训练的Sentence-BERT提供了隐式的结构信息，从形式上来说NLI与link prediction的形式也有一些相似。当然，这还只是非常初步的猜想，具体的结论还需要进一步探究。另外，从这一结果也给了一些启发，比如考虑图上的预训练模型时，能不能直接预训练一个语言模型，通过语言模型预训练更加成熟的解决方案，是不是还可以获得比预训练GNN更好的效果。同时，OpenAI提供的收费嵌入模型在节点分类这个任务上相比开源模型的提升很小。

第三，相比于未经微调的Deberta，LLaMA能够取得更好的结果，但是与句子嵌入这一类的模型还是有不小的差距。这说明相比于模型的参数大小，可能模型的种类是更重要的考量。对于Deberta，本文采用的是[CLS]作为句子向量。对于LLaMA，本文使用了langchain中的llama-cpp-embedding，它的实现中采用了[EOS]作为句子向量。在之前的相关研究中，已经有一些工作说明了为什么[CLS]在未经微调时性能很差，主要是由于其本身的各项异性，导致很差的可分性。经过实验，在高样本率的情况下，LLaMA生成的文本嵌入可以取得不错的下游任务性能，从侧面说明了模型的参数量增大可能可以一定程度上缓解这一问题。

文本级别的增强

对于特征级别的增强，本文得到了一些有意思的结果。但是，特征级别的增强还是需要语言模型是嵌入可见的。对于ChatGPT这类嵌入不可见的模型，可以使用文本级别的增强。对于这一部分，本文首先研究了一篇最近挂在Arxiv上的文章Explanation as features(TAPE)，其思想是利用LLM生成的对于预测的解释作为增强的属性，并通过集成的方法在OGB Arxiv的榜单上排到了第一名的位置。另外，本文也提出了一种利用LLM进行知识增强的手段Knowledge-Enhanced Augmentation(KEA)，其核心思想是把LLM作为知识库，发掘出文本中与知识相关的关键信息，然后生成更为详尽的解释，主要是为了不足参数量较小的语言模型本身知识信息的不足。两种模型的示意图如下所示。

为了测试两种方法的有效性，本文沿用了第一部分的实验设定。同时，考虑到使用LLM的成本，本文在Cora与Pubmed两个小图上进行了实验。对于LLM，我们选用了gpt-3.5-turbo，也就是大家所熟知的ChatGPT。首先，为了更好地理解如何进行文本级别的增强以及TAPE的有效性，我们针对TAPE进行了详细的消融实验。

在消融实验中，我们主要考虑了以下几个问题

TAPE的有效性主要来源于生成的解释E还是伪标签P
用哪种语言模型来编码增强的属性是最合适的

从实验结果可以看到，伪标签非常依赖于LLM本身的zero shot预测能力（会在下一章详细讨论），在低样本率时，可能反而会拖累集成后的性能。因此，在后续的实验中，本文只使用原始属性TA与解释E。其次，句子编码相比于微调预训练模型，可以在低标注率下取得更好的效果，因此本文采用句子编码模型e5。除此以外，一个有趣的现象是在Pubmed数据集上，当使用了增强后的特征，基于微调的方法可以取得非常好的性能。一种可能的解释是模型主要是学到了LLM预测结果的“捷径”(shortcut)，因此TAPE的性能会与LLM本身的预测准确率高度相关。接下来，我们比较TAPE与KEA之间的有效性。

实验结果中，KEA与TAPE相比原始特征都有一定的提升。其中，KEA在Cora上可以取得更好的效果，而TAPE在Pubmed上更为有效。经过下一章的讨论后，会发现这与LLM在Pubmed上本身就有良好的预测性能有关。相比于TAPE，由于KEA不依赖LLM的预测，所以在不同数据集上的表现会更稳定一些。超越这两个数据集之外，这种文本级别的增强还有更多的应用场景。像BERT或者T5这一类比较小的预训练语言模型，往往不具备ChatGPT级别的推理能力，同时也没有办法像ChatGPT那样对不同领域的诸如代码、格式化文本有良好的理解能力。因此，在涉及到这些场景的问题时，可以通过ChatGPT这类大模型对原有的内容进行转换。在转换过后的数据上训练一个较小的模型可以有更快的推理速度与更低的推理成本。同时，如果本身也有一定量的标注样本，通过微调的方式会比上下文学习更好地掌握数据集中的一些个性化信息。

利用LLM进行预测: LLMs-as-Predictors

在这一部分，本文进一步考虑能否抛弃GNN，通过设计prompt来让LLM生成有效的预测。由于本文主要考虑的是节点分类任务，因此一个简单的基线是把节点分类看作是文本分类任务来处理。基于这个想法，本文首先设计了一些简单的prompt来测试LLM在不使用任何图结构的情况下能有多少性能。本文主要考虑了zero shot, few shot,并且测试了使用思维链Chain of thought的效果。

实验结果如下图所示。LLM在不同的数据集上的性能差异非常大。在Pubmed数据集上，可以看到LLM在zero shot情况下的性能甚至超过了GNN。而在Cora,Arxiv等数据集上，又与GNN有较大的差距。注意，对于这里的GNN，在Cora，CiteSeer，Pubmed上，每一类有20个样本被选为训练集，而Arxiv与Products数据集上有更多的训练样本。相比之下，LLM的预测是基于零样本或者少样本的，而GNN并不具备零样本学习的能力，在少样本的情况下性能也会很差。当然，输入长度的限制也使得LLM无法囊括更多的上下文样本。

通过对实验结果进行分析，在某些情况下LLM预测错的结果也是比较合理的。一个例子如图12所示。可以看到，很多论文本身也是交叉领域的，因此预测时LLM通过自身的常识性信息进行推理，有时并不能与标注的偏好匹配到一起。这也是值得思考的问题：这种单标签的设定是合理的吗？

此外，在Arxiv数据集上LLM的表现最差，这与TAPE中的结论并不一致，因此需要比较一下两者的prompt有什么差异。TAPE使用的prompt如下所示。

Abstract: \n Title: \n Question: Which arXiv CS sub-categorydoes this paper belong to? Give 5 likely arXiv CS sub-categories as a comma-separated list ordered from most to least likely, in the form “cs.XX”, and provide your reasoning. \n \n Answer: </blockquote> 有意思的是，TAPE甚至都没有在prompt中指明数据集中存在哪些类别，而是直接利用了LLM中存在的关于arxiv的知识信息。奇怪的是，通过这个小变化，LLM预测的性能有巨大的改变，这不禁让人怀疑与本身测试集标签泄漏有关。作为高质量的语料，arxiv上的数据大概率是被包含在了各种LLM的预训练之中，而TAPE的prompt可能使得LLM可以更好地回忆起这些预训练语料。这提醒我们需要重新思考评估的合理性，因为这时的准确率可能反映的并不是prompt的好坏与语言模型的能力，而仅仅只是LLM的记忆问题。以上两个问题都与数据集的评估有关，是非常有价值的未来方向。 进一步地，本文也考虑了能否在prompt中通过文本的形式把结构信息也包含进来。本文测试了几种方式来在prompt中表示结构化的信息。具体地，我们尝试了使用自然语言“连接”来表示边关系以及通过总结周围邻居节点的信息来隐式表达边关系。 结果表明，以下这种隐式表达的方式最为有效。 <blockquote> Paper:<paper content> NeighborSummary:<Neighborsummary> Instruction:<Task instruction> </blockquote> 具体来说，模仿GNN的思路，对二阶邻居节点进行采样，然后将对应的文本内容输入到LLM中，让其进行一个总结，作为结构相关信息，一个样例如图13所示。 <a href="https://imgse.com/i/pPp9GNV"></a> 本文在几个数据集上测试了prompt的有效性，结果如图14所示。在除了Pubmed以外的其他四个数据集上，都可以相对不考虑结构的情况获得一定的提升，反映了方法的有效性。进一步地，本文分析了这个prompt为什么在Pubmed数据集上失效。 <a href="https://imgse.com/i/pPp917q"></a> 在Pubmed数据集上，很多情况下样本的标注会直接出现在样本的文本属性中。一个例子如下所示。由于这个特性的存在，想要在Pubmed数据集上取得比较好的结果，可以通过学习到这种“捷径”，而LLM在此数据集上特别好的表现可能也正源于此。在这种情况下，如果加上总结后的邻居信息，可能反而会使得LLM更难捕捉到这种“捷径”信息，因此性能会下降。 <blockquote> Title: Predictive power of sequential measures of albuminuria for progression to ESRD or death in Pima Indians with type 2 diabetes. … (content omitted here) Ground truth label: Diabetes Mellitus Type 2 </blockquote> 进一步地，在一些邻居与本身标签不同的异配(heterophilous)点上，LLM同GNN一样会受到邻居信息的干扰，从而输出错误的预测。 <a href="https://imgse.com/i/pPp9N3F"></a> <blockquote> GNN的异配性也是一个很有意思的研究方向，大家也可以参考我们的论文<a href="https://arxiv.org/abs/2306.01323">Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?</a> </blockquote> <h3 id="案例研究利用llm生成标注">案例研究：利用LLM生成标注</h3> 从上文的讨论中可以看到，在一些情况下LLM可以取得良好的零样本预测性能，这使得它有代替人工为样本生成标注的潜力。本文初步探索了利用LLM生成标注，然后用这些标注训练GNN的可能性。 <a href="https://imgse.com/i/pPp9Uc4"></a> 针对这一问题，有两个需要研究的点 <ul> <li>如何根据图结构和属性选择图中重要的点，来使得标注的收益最大化，这与图上主动学习的设定类似</li> <li>如果估计LLM生成的标注质量，并且过滤错误的标注</li> </ul> <h2 id="讨论">讨论</h2> 最后，简要讨论一下本文的局限性，以及一些有意思的后续方向。首先，需要说明的是本文主要针对的还是节点分类这个任务，而这个pipeline要扩展到更多的图学习任务上还需要更多的研究，从这个角度来说标题或许也有一些overclaim。另外，也有一些场景下无法获取有效的节点属性。比如，金融交易网络中，很多情况下用户节点是匿名的，这时如何构造能够让LLM理解的有意义的prompt就成为了新的挑战。 其次，如何降低LLM的使用成本也是一个值得考虑的问题。在文中，讨论了利用LLM进行增强，而这种增强需要使用每个节点作为输入，如果有N个节点，那就需要与LLM有N次交互，有很高的使用成本。在实验过程中，我们也尝试了像Vicuna这类开源的模型，但是生成的内容质量相比ChatGPT还是相去甚远。另外，基于API对ChatGPT进行调用目前也无法批处理化，所以效率也很低。如何在保证性能的情况下降低成本并提升效率，也是值得思考的问题。 最后，一个重要的问题就是LLM的评估。在文中，已经讨论了可能存在的测试集泄漏问题以及单标注设定不合理的问题。要解决第一个问题，一个简单的想法是使用不在大模型预训练语料范围内的数据，但这也需要我们不断地更新数据集并且生成正确的人工标注。对于第二个问题，一个可能的解决办法是使用多标签(multi label)的设定。对于类似arxiv的论文分类数据集，可以通过arxiv本身的类别生成高质量的多标签标注，但对更一般的情况，如何生成正确的标注还是一个难以解决的问题。 <h2 id="参考文献">参考文献</h2> [1] Zhao J, Qu M, Li C, et al. Learning on large-scale text-attributed graphs via variational inference[J]. arXiv preprint arXiv:2210.14709, 2022. [2] Chien E, Chang W C, Hsieh C J, et al. Node feature extraction by self-supervised multi-scale neighborhood prediction[J]. arXiv preprint arXiv:2111.00064, 2021. [3] He X, Bresson X, Laurent T, et al. Explanations as Features: LLM-Based Features for Text-Attributed Graphs[J]. arXiv preprint arXiv:2305.19523, 2023. </article> <article> <h1>Graph Machine Learning</h1> 2023-07-15T00:00:00+00:00 This blog is wroten with the help of Yuanqi Du and Yanbang Wang. See full contents in <a href="https://ai4science101.github.io/blogs/graph_machine_learning/">AI4science blog 101</a> <h4 id="before-you-start">Before you start</h4> Graph is the fundamental data structure that denotes pairwise relationships between entities across various domains, e.g., web, gene, and molecule. Machine learning on graph, typical on Graph Neural Network, becomes more and more popular in recent years. In this blog, we will introduce some basic concepts of machine learning on graph. We hope it may give you inspiration on: <ul> <li> what is graph? why do we need graph? How to solve graph-related problems with machine learning techniques? </li> <li> How to correlate your specific task with the graph and view it as a graph problem? </li> <li> How to utilize existing techniques to solve your specific task? </li> </ul> Before going deep into the technical details, we first provide some motivations by introducing some histories on the developement of graph Neural Network (GNN). The history of GNN is emerged as a response to two significant challenges. The first challenge came from the data mining domain, where researchers were exploring ways to extend deep learning techniques to handle structured network data. Examples of such data include the World Wide Web, relational databases, and citation networks. The second challenge arose from the science domain, where researchers were attempting to apply deep learning techniques to practical science problems such as single-cell analysis, brain network analysis, and molecule property prediction. To meet these practical challenges, the GNN community has grown rapidly, with researchers collaborating across different fields beyond data mining. <h1 id="graph-type-graph-task-towards-graph-modeling">Graph type, Graph task, towards Graph modeling</h1> <h2 id="what-are-graphs-why-are-graphs-ubiquitous-in-science">What are graphs? Why are graphs ubiquitous in science?</h2> <h4 id="sec:whatgraph">What are graphs?</h4> The graph is a data formulation that is widely utilized to describe pairwise relations between nodes. Mathematically, a graph can be denoted as $\mathcal{G}=\left \{\mathcal{V}, \mathcal{E} \right \}$. $\mathcal{V}= \left \{v_1, v_2, \cdots, v_N \right \}$ is a set of $N=\left | \mathcal{V} \right |$ nodes. $\mathcal{E}= \left \{e_1, e_2, \cdots, e_M \right \}$ is a set of $M=\left | \mathcal{E} \right |$ which describes the connections between nodes. $e=(v_1, v_2)$ indicates there is an edge exists from node $v_1$ to node $v_2$. Moreover, nodes and edges can have corresponding features $X_V\in \mathbb{R}^{N\times d}$, $X_E\in \mathbb{R}^{M\times d}$, respectively. <h4 id="why-are-graphs-ubiquitous-in-science">Why are graphs ubiquitous in science?</h4> The main advantage of the graph formulation is the universal representation ability. Universal represents that graph can be a natural representation for arbitrary data. In the data mining domain, much data can be naturally represented as a graph. Examples are shown in Figure 1 <ul> <li> Social network [1] can be represented as a graph. Each node represents one user. Each edge indicates that the relationship exits between two users, e.g., friendship, domestic relationship, </li> <li> Transport Network [2] can be represented as a graph. Each node represents one station. Each edge indicates that a route exists between two stations. </li> <li> Web Network [3] can be represented as a graph. Each node represents one web page. Each edge indicates that a hyperlink exists between two pages. </li> </ul> <table> <tr> <td> (a) Social Network </td> <td> (b) Transport Network </td> <td> (c) Web Network </td> </tr> </table> <center>Figure 1: Examples for graph data in data mining domain</center> Moreover, the graph can also generalize into different domains. In the computer vision domain, The image can be viewed as a grid graph. In the natural language processing domain, the sentence can be viewed as a path graph. IN AI4Science, the graph can adapt to all scientific problems easily. More concrete examples are shown in Figure 2 <ul> <li> Brain network [4] can be represented as a graph. Nodes represent brain regions, and edges represent connections between them. Connections can be structural, such as axonal projections, or functional, such as correlated activity between brain regions. Brain network graphs can be conducted with different scales, ranging from individual neurons and synapses to large-scale brain regions and networks. </li> <li> Gene-gene network [5] can be represented as a graph. In a gene-gene network, nodes represent genes, and edges represent interactions between them. These interactions can be based on different types of experimental evidence, such as co-expression, co-regulation, or protein-protein interactions. Gene-gene networks can be conducted with different levels of complexity, from small subnetworks involved in specific biological pathways to large-scale networks that span the entire genome. </li> <li> Molecule network [6] can be represented as a graph. chemical compounds are denoted as graphs with atoms as nodes and chemical bonds as edges. Molecular networks can be conducted with different levels of complexity, from simple compounds such as water and carbon dioxide to complex biomolecules such as proteins and DNA. </li> </ul> <table> <tr> <td> (a )Gene-gene Network </td> <td> (b) Brain Network </td> <td> (c) Molecule Network </td> </tr> </table> <center>Figure 2: Examples for graph data in AI4Science domain</center> <h2 id="diverse-graph-formulations">Diverse Graph Formulations</h2> The simple graph mentioned in Section [1.1] shows the most basic formulation of the graph which only takes single node and edge type into consideration. However, different data may have additional features which cannot be easily handled on the single graph formulation. In this subsection, we will briefly describe popular complex graphs including the heterogeneous graph, bipartite graph, multidimensional graph, signed graph, hypergraph, and dynamic graph. <h4 id="bipartite-graph">Bipartite Graph</h4> The bipartite graph formulation is a special single graph where edges can only between two node sets $\mathcal{V}_1$ and $\mathcal{V}_2$. Two node sets should have: (1) no overlap between two node sets: $\mathcal{V}_1 \cap \mathcal{V}_2 = \emptyset$. (2) contains all nodes: $\mathcal{V}_1 \cup \mathcal{V}_2 = \mathcal{V}$. The bipartite graph is utilized to describe the interactions between different objectives. It is typically utilized in the e-commerce system to describe the interaction between users and documents. It can also be utilized on different science problems. <h4 id="signed-graph">Signed Graph</h4> The signed graph is introduced to describe the graph with two edge types: positive edges and negative edges. A signed graph $\mathcal{G}$ consists of a set of nodes $\mathcal{V}=\{v_1, \cdots, v_N \}$ and a set of edges $\mathcal{E}=\{e_1, \cdots, e_M \}$. Additionally, there is an edge-type mapping function $\phi_e:\mathcal{E}\to\mathcal{T}_e$ that map each edge to their types, positive or negative. $\mathcal{T}_e = \left \{1, -1 \right \}$ indicate the edge type, positive or negative. It is typically utilized in social networks like Twitter, where the positive edge indicates following, and the negative edge indicates block or unfollow. It can also be utilized on different science problems. <h4 id="heterogeneous-graph">Heterogeneous Graph</h4> The heterogeneous graph introduced more node types on the graph. New relationship types can also be found as edges can be found between different node types. For example, the simple citation network can be represented with the single graph formulation, where each node represents a paper, each edge represents one paper cites another one. However, the citation network can be more complex when considering: (1) authors. authors could have a co-author relationship. The author could also write papers. (2) Paper types. Paper can have different types, e.g., Data Mining, Artificial Intelligence, Computer Vision, and Natural Language Processing. A Heterogeneous graph $\mathcal{G}$ consists of a set of nodes $\mathcal{V}=\{v_1, \cdots, v_N \}$ and a set of edges $\mathcal{E}=\{e_1, \cdots, e_M \}$. Additionally, there are two mapping functions $\phi_n:\mathcal{V}\to\mathcal{T}_n$, $\phi_e:\mathcal{E}\to\mathcal{T}_e$ that map each node and each edge to their types, respectively. $\mathcal{T}_e$ indicate the set of node an edge type. <h4 id="multidimensional-graph">Multidimensional Graph</h4> Multidimensional graph is introduced to describe multiple relationships that simultaneously exist between a pair of nodes. It is different from the signed graph and the heterogeneous graph that both of them do not allow multiple edges between a pair of nodes. A multidimensional graph consists of a set of $N$ nodes $\mathcal{V}= \{v_1, \cdots, v_n \}$ and $D$ sets of edges $\{\mathcal{E}_1, \cdots, \mathcal{E}_D \}$. Each edge set $\mathcal{E}_d$ describes the $d$th type of relation between nodes. The intersection between different edge sets is allowed. It is typically utilized in the social network. Users can "like", "Retweet", and "comment" on the tweet. Each action corresponds to one relationship between user and tweet. It can also be utilized on different science problems. <h4 id="hypergraph">Hypergraph</h4> The hypergraph is introduced when you are required to consider the relationship beyond a pair of nodes. A hypergraph $\mathcal{G}$ consists of a set of $N$ nodes $\mathcal{V}= \{v_1, \cdots, v_n \}$ and a set of hyperedges $\mathcal{E}$. The incident matrix $\mathbf{H} \in \mathbb{R}^{|\mathcal{V}|\times |\mathcal{E}| }$ instead of using the adjacent matrix $\mathbf{A}$ is utilized to describe the graph structure. \[H_{i j} = \begin{cases} 1 & \text{if vertex } v_{i} \text{ is incident with edge } e_{j} \\ 0 & \text{otherwise.} \end{cases} \tag{1}\] It is typically utilized in the academic network. where nodes are papers and authors. One author can publish more than one paper which can be viewed as a hyper-edge connecting multiple papers. <h4 id="dynamic-graphs">Dynamic Graphs</h4> Dynamic graph is introduced when the graph constantly evolves where new nodes and edges may be added and some existing nodes and edges may disappear in the graph. A dynamic graph $\mathcal{G}$ consists of a set of $N$ nodes $\mathcal{V}= \{v_1, \cdots, v_n \}$ and a set of edges $\mathcal{E}$ where each node and edge is associated with a timestamp indicating the time it emerged. We have two mapping functions $\phi_v$, and $\phi_e$ mapping each node and each edge to the timestamps, respectively. It is typically utilized in the social network, where nodes are users on Twitter. There are new users every day and they can follow and unfollow other users from time to time. <h4 id="knowledge-graph">Knowledge Graph</h4> Knowledge Graph is an important application on the graph domain. It is comprised of nodes and edges, where nodes $\mathcal{V}$ represent entities (such as people, places, or objects) and edges $\mathcal{E}$ represent relationships $\mathcal{R}$ between these entities. These relationships can be diverse, including semantic relations (e.g., "is a" or "part of"), factual associations (e.g., "born in" or "works at"), or other contextual links. The graph-based structure allows for efficient querying and traversing of data, as well as the ability to infer new knowledge by leveraging existing connections. A Knowledge Graph is a structured representation of information that aims to model the relationships between entities, facts, and concepts in a comprehensive and interconnected way. It provides a flexible and efficient means of organizing, querying, and deriving insights from large volumes of data, making it a powerful tool for information retrieval and knowledge discovery. It is widely utilized in the Semantic web which enables machines to better understand and interact with web content by organizing information in a machine-readable format. Remark: In this subsection, we briefly introduce different graph formulations in this subsection. However, the real-world case could be more complicated. For example, The Network in E-commerce could be a Heterogeneous bipartite multidimensional graph. It typically corresponds to the following scenarios: (1) Heterogeneous: Customer and purchaser could be different user types. Different items also have different types. (2) bipartite: Users could only have interactions with the items. (3) multidimensional: Users could have different interactions on the items, e.g., "buy" "add to shopping cart", and so on. The graph formulations described in this subsection are more like prototypes. You can design the typical graph formulation for your data. It could be easy to learn from the recent progress on the corresponding graph type to your data. <h2 id="what-are-typical-tasks-on-graph">What are typical tasks on graph?</h2> In this subsection, we provide a brief introduction on the graph-related tasks to show how we can utilize the graph on different scenarios. We typically introduce node classification, graph classification, graph generation, link prediction tasks. Most downstream tasks can be viewed as an instance for the above tasks <h4 id="node-classification">Node Classification</h4> Node classification aims to identify which class the graph node should belong by utilizing the ego feature, adjacent matrix, and features from other nodes. The node classification task has numerous real-world applications. Examples are as follows: (1) social network analysis: In social networks, nodes and edges represent each individual and social relationships. Node classification can be utilized to predict various attributes, such as interests, affiliation, profession and so on. (2) Bioinformatics: In biological networks, nodes represent genes, proteins, or other biological entities, and the connections between nodes represent interactions such as regulatory or metabolic relationships. Node classification can be utilized to predict various node properties, such as the function, localization, or disease association. (3) Cybersecurity: In network security, nodes represent computers, servers, or other network devices, and the connections between nodes represent communication or access relationships. Node classification can be utilized to detect various types of network attacks or anomalies, such as malware, spam, or intrusion attempts. <h4 id="graph-classification">Graph Classification</h4> Graph classification aims to identify which class the graph should belong with exploiting both rich information from the graph structure and the node feature. Image classification can be viewed as a special case for the graph classification task. Each pixel can be viewed as a node, where RGB is the corresponding node feature. The graph structure on image is a grid which connects the adjacent pixels. Graph classification has been broadly utilized in many real-world applications. Examples are shown as follows. (1) bioinformatics: The graph classification can be utilized to identify biological networks into different categories. For example, we could classify a set of protein-protein interaction networks based on their function or disease association. It can help identify potential drug targets, protein complexes, or pathways, and inform drug discovery. (2) chemistry: The graph classification can be utilized to identify chemical compounds into different categories. For example, we could classify a set of compounds based on their toxicity or therapeutic potential. (3) Social Network Analysis: Graph classification can be utilized to identify the discussion topic of a tweet in Twitter. <h4 id="link-prediction">Link Prediction</h4> Link prediction can be viewed as a binary classification task predicting whether there is a link exists between two nodes on the graph. It could complete the graph and find the under-discovered relationship between nodes. Link prediction has been broadly utilized in many domains. Examples are shown as follows: (1) Friend recommendation in the social network. Twitter could recommend you some friends you may know or interested in. (2) Movie recommendation. Netflix will recommend you the film you may be interest in. (3) bioinformatics: In biological networks, link prediction can be utilized to predict the likelihood of physical interactions between pairs of proteins based on their sequence similarity, domain composition, or other features. It can help identify potential drug targets, protein complexes, or pathways, and inform drug discovery. <h4 id="graph-generation">Graph Generation</h4> In contrast to the aforementioned tasks, graph generation aims to solve the generative problem: given a dataset of graphs, learn to sample new graphs from the learned data distribution. As graph could represent many highly-structured data, graph generation has the promises for design tasks in a variety of domains such as molecular graph generation (drug & materials discovery), circuit network design, in-door layout design, etc. <h1 id="how-to-model-graph-structured-data">How to Model Graph Structured Data?</h1> In this section, we aim to introduce (1) the Graph Neural Networks which have become popular for learning graph representations by jointly leveraging attribute and graph structure information. (2) understanding perspectives on GNN which connect GNN design to other domains, e.g., graph signal process, Weisfeiler-Lehman Isomorphism Test, and so on (3) traditional graph machine learning methods and structure-agnostic methods which may perform even better than GNN <h2 id="graph-neural-network">Graph Neural Network</h2> The design of Graph Neural Network is inspired from the Convolution Neural Network which is one of the most widely-used Neural Networks in the computer vision domain. It takes effort to utilize the neighborhood pixel to learn a good representation. Concretely speaking, convolutional Neural Networks extract different feature patterns by aggregating the neighboring pixels in a fixed-size receptive field, for example, a receptive field with $3\times 3$ neighborhood pixels. To extend the superiority of CNN to the graph, researchers develop the Graph Neural Network. There are two essential problems in developing the Graph Neural network. <ul> <li> How to define the receptive field on graph since it is not a regular grid? </li> <li> What feature patterns are useful on the graph? </li> </ul> Those two questions lead to two crucial perspectives on designing Graph Neural Networks, spectral and spatial perspectives, respectively. Before going into the details in those details, we first provide a definition of the general Graph Neural Network Framework. <h4 id="a-general-framework-for-graph-neural-network">A General Framework for Graph Neural Network</h4> We introduce the general frameworks of GNNs for the most basic node-level task. We first recap some notations on the graph. We denote a graph as $\mathcal{G}= \left \{ \mathcal{V}, \mathcal{E} \right \}$ (i.e. molecule). The adjacent matrix and the associated features are denoted as $\mathbf{A}\in \mathbb{R}^{N \times N}$ (i.e. bond type) and $\mathbf{F}\in \mathbb{R}^{N \times d}$ (i.e. atom type), respectively. $N$ and $d$ are the numbers of nodes and feature dimensions, respectively. A general framework for Graph Neural Networks can be regarded as a composition of $L$ graph filter layers, and $L-1$ nonlinear activation layers. $h_i$ and $\alpha_i$ are utilized to denote the $i$-th graph filter layer, and activation layer, respectively. $\mathbf{F}_i \in \mathbb{R}^{N\times d_i}$ denotes the output of the $i$-th graph filter layer $h_i$. $\mathbf{F}_0$ is initialized to be the raw node features $\mathbf{F}$. <h4 id="spatial-graph-filter-how-to-define-the-receptive-field">Spatial Graph filter: How to define the receptive field?</h4> For the image with a regular grid structure, the receptive fields are defined as the neighborhood pixel around the central pixel. An example is illustrated in Fig. . So how to define the receptive field on the graph with no unified regular structure? The answer is the neighborhood nodes along the edge. One hop neighborhood of node $v_i$ can be defined as $\mathcal{N}_{v_i} = \left \{ v_j s.t., (v_i, v_j) \in \mathcal{E}\right \}$. To adaptively extract the neighborhood information, a large variety of spatial-based graph filters are proposed. We introduce two typical spatial Graph-filter layers, GraphSAGE and GAT, in this section. GraphSAGE [7]: The GraphSAGE model proposed in () introduced a spatial-based filter that aggregation information from neighboring nodes. The hidden feature for node $v_i$ is generated with the following steps. <ul> <li> Sample neighborhood nodes from the neighborhood set. $\mathcal{N}_S(v_i)=\text{SAMPLE}(\mathcal{N}(v_i), S)$ where $\text{SAMPLE}()$ is a function that takes the neighborhood set as input, and random sample $S$ instances as the output. </li> <li> Extract the information from neighborhood nodes. $f_i' = \text{AGGREGATE}( \left \{ \mathbf{F}_j, \forall v_j \in \mathcal{N}_S(v_i) \right \} )$ where $\text{AGGREGATE}: \mathbb{R}^{M\times d} \to \mathbb{R}^{d}$ is a function to combine the information from the neighboring nodes. </li> <li> combine the neighborhood information with the ego information $\mathbf{F}_i=\sigma \left ( [\mathbf{F}_i, \mathbf{f}'_i] \mathbf{\Theta} \right )$ where $[\cdot, \cdot]$ is the concatenation operation, $\Theta$ is the learnable parameters. </li> </ul> The aggregation can be a set function with different aggregators including mean, maximum aggregators, which takes the element-wise mean, and maximum operator. sum aggregator is later introduced by () with stronger expressive ability. GAT [8]: The Graph Attention Network (GAT) is inspired by the self-attention mechanism. GAT adaptively aggregates the neighborhood information based on the attention score. The hidden feature for node $v_i$ is generated with the following steps. <ul> <li> generates the attention score with the neighborhood node. $a(\mathbf{F}_i\mathbf{\Theta}, \mathbf{F}_j\mathbf{\Theta})=\text{LeakyReLU} (\mathbf{a}^T \left [ \mathbf{F}_i\mathbf{\Theta}, \mathbf{F}_j\mathbf{\Theta} \right ]) \text{s.t.},v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}$ where $a$ is a </li> <li> normalizes the attention score via softmax. $\alpha_{ij} = \frac{\exp{e_{ij}}}{\sum_{v_k \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \exp{e_{ik}}}$ </li> <li> aggregation the weighted information from neighborhoods. $\mathbf{F}'_i = \sum_{v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \alpha_{ij}\mathbf{F}_i\mathbf{\Theta}$ </li> <li> multi-head attention implementation. $\mathbf{F}'_i = ||_{m=1}^M \sum_{v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \alpha_{ij}^m \mathbf{F}_j \mathbf{\Theta}^m$ Where $||$ is the concatenation operator, $M$ is the number of heads. </li> </ul> Notice that, the key difference between the GAT and self-attention mechanism is that, self-attention is conducted on all the nodes, where the GAT is conducted on the neighborhood nodes. More discussion can be found in the next section. <h4 id="spectral-graph-filter-what-feature-patterns-are-useful-on-the-graph">Spectral Graph filter: What feature patterns are useful on the graph?</h4> Spectral-based Graph Filters majorly utilize the spectral graph theory to develop the filter operation in the spectral domain. We will only provide some motivations for Spectral-based Graph Filters without mathematical details. The motivation behind spectral graph filters is that neighboring nodes in a graph should have similar representations. In the context of spectral graph theory and filters, neighborhood similarity corresponds to the low-frequency components which changes in the graph structure that occur slowly or gradually. Contrastively, high-frequency components correspond to rapid or abrupt changes. By focusing on the low-frequency components, spectral graph filters can capture the underlying smooth variations in the graph topology, which can be useful for various tasks e.g., node classification, link prediction, and graph clustering. In other words, spectral graph filters aim to identify feature patterns that are smooth and do not vary significantly across different nodes. It corresponds to the low-frequency components of the graph structure based on spectral graph theory. GCN [9]: We only provide a brief introduction on the formulation of the Graph Convolutional Network (GCN). A more comprehensive study can be found in Section 5.3.1 of <a href="https://web.njit.edu/~ym329/dlg_book/index.html">Deep Learning on Graphs</a> [10]. The aggregation function of GCN is defined as: $\mathbf{F}'= \sigma( \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}\mathbf{F}\mathbf{\Theta}) \tag{2}$ where $\sigma$ is the activation function, $\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}$ is the symmetric normalized adjacent matrix. The aggregation function for each edge can be defined as: $\mathbf{F}'_i = \sum_{v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \frac{1}{\sqrt{\tilde{d}_i\tilde{d}_j}} \mathbf{F}_j\mathbf{\Theta}\tag{3}$ where $\tilde{d}_i$ is the degree of node $i$. <h4 id="message-passing-neural-network">Message Passing Neural Network</h4> The above discussion focused on GNN design for the simple graph with a single node and edge type. Message Passing Neural Network (MPNN) is then proposed as a more general framework that could cover the entire design space for GNNs. Concretely speaking, MPNNs are a family of neural networks that operate on graphs by (1) generating messages between nodes based on their local neighborhoods. (2) aggregating messages from neighboring nodes iteratively to MPNNs can learn powerful graph representations for various downstream tasks. The above discussion focuses on the GNN design on the simple graph with single node and edge type. Message Passing Neural Network is a re A more general Graph Neural Network. Message passing Neural Network is a more general framework which could cover the whole design space for GNN. The Message Passing filter can be defined as: $h_{i}^{\ell+1}=\phi\left(h_{i}^{\ell}, \oplus_{j \in \mathcal{N}_{i}}\left(\psi\left(h_{i}^{\ell}, h_{j}^{\ell}, e_{i j}\right)\right)\right)\tag{4}$ where $\phi$, $\psi$ are Multi-Layer Perceptrons (MLPs), and $\oplus$ is a permutation-invariant local neighborhood aggregation function such as summation, maximization, or averaging. Focusing on one particular node $i$, the MPNN layer can be decomposed into three steps as: <ul> <li> Message: For each pair of linked nodes $i$, $j$, the network first computes a message $m_{i j}=\psi\left(h_{i}^{\ell}, h_{j}^{\ell}, e_{i j}\right)$ The MLP $\psi: \mathbb{R}^{2d+d_e}\to \mathbb{R}^{d}$ takes as input the concatenation of the feature vectors from the source node, target node, and edge feature. </li> <li> Aggregate: At each source node $i$, the incoming messages from all its neighbors (target node) are then aggregated as $m_{i}=\oplus_{j \in \mathcal{N}_{i}}\left(m_{i j}\right)$ </li> <li> Update: Finally, the network updates the node feature vector $h_{i}^{\ell+1}=\phi\left(h_{i}^{\ell}, m_{i}\right)$ by concatenating the aggregated message $m_i$ and the previous node feature vector $h_i^{\mathcal{l}}$, and passing them through an MLP $\phi: \mathbb{R}^{2 d} \rightarrow \mathbb{R}^{d}$. </li> </ul> <h4 id="permutation-equivarainceinvariance">Permutation Equivaraince/Invariance</h4> A function $f$ is said to be equivariant if for any transformation $\tau$ of the input space $X$, and any input $x\in X$, we have: $f(\tau(x)) = \tau(f(x))$. In other words, applying the transformation $\tau$ to the input has the same effect as applying it to the output. A function $f$ is said to be invariant if for any transformation $\tau$ of the input space $X$, and any input $x\in X$, we have: $f(\tau(x)) = f(x)$. In other words, applying the transformation $\tau$ to the input does not change the output. In the context of GNNs, we want to achieve permutation-equivariance or permutation-invariance, which means that the function should be equivariant or invariant to permutations of the input graph. We can express this mathematically by defining a permutation $\sigma$ of the nodes of the input graph $G=(V,E)$, and requiring that the output of the GNN is the same regardless of the permutation: $f(G) = f(\sigma(G))$, where $\sigma(G)$ is the graph obtained by applying the permutation $\sigma$ to the nodes of $G$. <h2 id="understanding-perspectives-on-gnn">Understanding perspectives on GNN</h2> <center>Figure 3: An overview of the connection between WL-test and Graph Neural Network. Middle panel: rooted subtree structures (at the blue node) that the WL test uses to distinguish different graphs. Right panel: if a GNN’s aggregation function captures the full multiset of node neighbors, the GNN can capture the rooted subtrees in a recursive manner and be as powerful as the WL test[11]</center> <h4 id="gnn-expressiveness-and-weisfeiler-lehman-isomorphism-test">GNN Expressiveness and Weisfeiler-Lehman Isomorphism Test.</h4> The expressiveness of Graph Neural Network is highly related with the <a href="https://www.davidbieber.com/post/2019-05-10-weisfeiler-lehman-isomorphism-test/">graph isomorphism test</a>. An expressive GNN should map the isomorphic graphs into the same representation and distinguish non-isomorphic graphs with different representations. The Weisfeiler-Lehman (WL) test is a popular graph isomorphism test used to determine whether two graphs are isomorphic, meaning two graphs have the same underlying structure but may differ in the node labeling. The intuition for WL-test is that if two graphs are isomorphic, then their structures should be similar across all hops of neighborhoods, from one-hop neighborhoods to the global structure of the entire graph. The algorithm iterates on the following two steps: (1) aggregation: collect a set of neighbor node labels (2) labeling: assigned a new label based on the label set of neighbor nodes. The WL-test will repeat this labeling and aggregation process until convergence (node label does not change). We can then identify whether two graphs are isomorphic if they have the same sequence of refined graphs or not. The WL-test is widely utilized in different domains since it is efficient with the time complexity $O(n \log (n))$, where $n$ is the number of the nodes. More recently, the WL-test is widely utilized for analyzing the expressiveness of GNN. <h4 id="gnn-and-transformers">GNN and Transformers.</h4> Graph Neural Networks and Transformer architectures are typically two popular model architectures to leverage the context information. <a href="https://graphdeeplearning.github.io/post/transformers-are-gnns/">Connections</a> can be found between those two architectures. $\begin{array}{c} h_{i}^{\ell+1}=\operatorname{Attention}\left(Q^{\ell} h_{i}^{\ell}, K^{\ell} h_{j}^{\ell}, V^{\ell} h_{j}^{\ell}\right), \\ i . e ., h_{i}^{\ell+1}=\sum_{j \in \mathcal{S}} w_{i j}\left(V^{\ell} h_{j}^{\ell}\right), \end{array}$ where $w_{i j}=\operatorname{softmax}_{j}\left(Q^{\ell} h_{i}^{\ell} \cdot K^{\ell} h_{j}^{\ell}\right)$. $j\in \mathcal{S}$ denotes the set of words in the sentence $\mathcal{S}$ and $Q^{\mathcal{l}}, K^{\mathcal{l}}, V^{\mathcal{l}}$ are learnable linear weights. Three matrices denote the Query, Key, and Value for the attention respectively. One update on each word embedding can be viewed as a weighted aggregation of all word embeddings in the sentence. An illustration of self-attention block in Transformer is shown in Fig. 4(b) One Graph Neural Network block can be defined as follows: $h_{i}^{\ell+1}=\sigma\left(U^{\ell} h_{i}^{\ell}+\sum_{j \in \mathcal{N}(i)}\left(V^{\ell} h_{j}^{\ell}\right)\right),\tag{5}$ where $U^{\mathcal{l}}, V^{\mathcal{l}}$ are learnable transformation matrices of the GNN layer and $\sigma$ is the non-linearity activation function. One update for the hidden representation $h_i$ of node $i$at layer $\mathcal{l}$ be viewed as a weighted aggregation of neighborhood nodes representation $j\in \mathcal{N}(i)$. An illustration of GNN block is shown in Fig. 4(a) <table> <tr> <td> (a) GNN block </td> <td> (b) Transformer block </td> </tr> </table> <center>Figure 4. GNN vs Transformer </center> The key difference between Graph Neural Network and transformer is that Graph Neural Network only aggregates on the neighborhood nodes, while Transformer will aggregate on all the words in the sentence. In another word, Transformer can be viewed as a GNN aggregated on a fully-connected word graph. In other words, both Graph Neural Network and Transformer aim to learn good representation by incorporating context information. Transformer recognizes all the words in one sentence are useful while GNN only recognizes that the neighborhood nodes are useful. <h4 id="gnn-and-graph-signal-denoising-processes">GNN and Graph Signal Denoising Processes.</h4> Graph signal denoising [12] offers a new perspective to create a uniform understanding on representative aggregation operations. The graph signal denoising is to recover a clean signal from the original noisy signal. It can be defined as solving the following optimization problem: $arg \min_F \mathcal{L}=||F-S||_F^2 + c \cdot \text{tr}(F^TLF) \tag{6}$ where $S\in \mathbb{R}^{N\times d}$ is a noisy signal (input feature) on graph $\mathcal{G}$. $F\in \mathbb{R}^{N\times d}$ is the clean signal assumed to be smooth over $\mathcal{G}$. The first term guides $F$ to be close to $S$, while the second term $tr(F^TLF)$ is the Laplacian regularization which guides $F$’s smoothness over $\mathcal{G}$, with $c > 0$’s mediation. Assuming that we adopt the unnormalized version of Laplacian matrix with $L = D - A$ (the adjacency matrix $A$ is assumed to be binary), the second term can be written in an edge-centric way as: $c \sum_{(i,j)\in \mathcal{E}} ||F_i-F_j||_2^2\tag{7}$ which leads to the connected nodes sharing similar features. We show the connection between the graph signal process and GCN as an example here. The gradient with respect to $F$ at $S$ is $\frac{\partial \mathcal{L}}{\partial F} \|_{F = S} = 2cLS$ Hence, one-step gradient descent for the graph signal denoising problem equation [8] can be described as: \begin{aligned} F \leftarrow S - b\left. \frac{\partial \mathcal{L}}{\partial F} \right|_{F = X} &= S - 2bcLS \nonumber &= (1-2bc )S+ 2bc\tilde{A}S. \end{aligned} When stepsize $b=\frac{1}{2c}$ and ${ S}={ X}'$, we have $F \leftarrow \tilde{A}X'$, which is the same as the aggregation operation of GCN. It provides a new perspective to understand existing GNNs as a tradeoff between the original feature preservation and neighborhood smoothness. Moreover, it inspires us to derive new Graph Neural Networks from different graph signal processing methods. <h4 id="gnn-and-gradient-flow">GNN and Gradient Flow.</h4> A new <a href="https://towardsdatascience.com/graph-neural-networks-as-gradient-flows-4dae41fb2e8a">physical-inspired perspective</a> is to understand Graph Neural Network as a discrete dynamical system of particle [13]. Each node on the graph corresponds to one particle while the edge represents pair-wise interactions between nodes. The positive and negative interactions between nodes could be interpreted as attraction and repulsion between particles, respectively. To view Graph Neural Network as a discrete dynamical system, one can correspond the input forward layer by layer as the input evolution by a system of differential equations. Each discrete time step in the dynamic system corresponds to one layer forward process. Gradient flow is a special type of evolution equation of the form $f(X(t))=- \nabla \mathcal{E}(X(t))\tag{9}$ where $\mathcal{E}$ is an energy functional, which could be different for different GNNs. The gradient flow makes $\mathcal{E}$ monotonically decrease during the evolution. Simple GNN can be viewed as the gradient flow of the Dirichlet energy $\mathcal{E}^{\text{DIR}} =\frac{1}{2} \text{trace}(X^TLX)\tag{10}$ The Dirichlet energy measures the smoothness of the features on the graph. In the limit $t\to \infty$, all node features is extremely smooth that all the nodes become the same. It indicates that the system loses the information contained in the input features. This phenomenon is called ‘oversmoothing’ in the GNN literature. To design better Graph Neural Network to overcome drawback like oversmooth, we can parametrise an energy and deriving a GNN as its discretised gradient flow. It offers better interpretability and leads to more effective architectures. <h4 id="gnn-and-dynamic-programming">GNN and Dynamic Programming</h4> Dynamic programming on graphs is a technique that involves solving problems by breaking them down into smaller subproblems and finding optimal solutions to those subproblems. This approach can be used to solve a wide range of problems on graphs, including shortest path problems, maximum flow problems, and minimum spanning tree problems. Such an approach shares the similar idea with the aggregation operation on GNN which recursively combines information from neighboring nodes to update the representation of a given node. Both GNN aggregation and dynamic programming on graphs involve combining information from neighboring nodes to update the representation of a given node. In dynamic programming, the combination of information is typically done by recursively solving subproblems and building up a solution to a larger problem. Similarly, in GNN aggregation, neighboring node information is combined through various aggregation functions (e.g. mean, max, sum), and the updated node representation is then passed to subsequent layers in the network. In both cases, the goal is to efficiently compute a global solution by leveraging local information from neighboring nodes. However, vanilla GNNs cannot solve most dynamic programming problems, e.g., shortest path algorithm, and generalized Bellman-Ford algorithm, without capturing the underlying logic and structure of the corresponding problem. To empower GNN with the reasoning ability in dynamic programming, multiple operators are then proposed to generalize the operation in dynamic programming to the Neural Network, e.g., the sum generalizes to a commutative summation operator $\oplus$, the product $\otimes$ generalizes to a Hadamard product operator. GNNs can then be extended with different dynamic programming algorithms with improving generalization ability. A simple example of the Graph Neural Network extending to the Bellman-Ford algorithm can be found in Figure. 4 <center>Figure 5: The framework [14] suggests that better algorithmic alignment improves generalization. The computation structure of the GNN (left) aligns well with the Bellman-Ford algorithm (right). GNN can simulate Bellman-Ford by merely learning a simple reasoning step.</center> <h2 id="methods-before-graph-neural-network">Methods before Graph Neural Network</h2> Graph Neural Network is well-recognized as a powerful method for machine learning on graph. However, GNN is still not the dominant method in the graph domain. Traditional machine learning methods on graph and non-graph methods still reveal advantages over the Graph Neural Network. They still hold an important position on graph research and inspire the design of the new Graph Neural Network. In this section, we will first introduce some important machine learning methods beyond graph including Graph Kernel methods for graph classification, label propagation for node classification, and heuristic methods for link prediction. <h4 id="label-propagation-for-node-classification">Label Propagation for node classification</h4> Label Propagation is a simple but effective method for node classification in graphs. It is a semi-supervised learning technique that leverages the idea that nodes that are connected in a graph are likely to share the same label or class. For example, it could be utilized to a network of people with two labels "interested in cricket" and "not interested in cricket". We only know the interests of a few people and we aim to predict the interests of the remaining unlabeled nodes. The procedure of label propagation can be found as follows. $A$ be the $n \times n$ adjacency matrix of the graph, where $A_{ij}$ is 1 if there is an edge between nodes $i$ and $j$, and 0 otherwise. Let $Y$ be the $n \times c$ matrix of node labels, where $Y_{ij}$ is 1 if node $i$ belongs to class $j$, and 0 otherwise. Let $F$ be the $n \times c$ matrix of label distributions, where $F_{ij}^{(t)}$ is the probability of node $i$ belonging to class $j$ at iteration $t$. At each iteration $t$, the label distribution $F^{(t)}$ is updated based on the label distributions of the neighboring nodes as follows: $F^{(t)}=AF^{(t-1)}D^{-1}\tag{11}$ where $D$ is the diagonal degree matrix of the graph, where $D_{ii} = \sum_j A_{ij}$. After a certain number of iterations or when the label distributions converge, the labels of the nodes are assigned according to the label distribution with the highest probability: \[Y_i = arg\max_j F^{(t)}_{ij}\tag{12}\] This process is repeated until the labels converge to a stable state or until a stopping criterion is met. $\hat{\mathbf{Y}}=(\mathbf{D}^{-1}\mathbf{A})^t\mathbf{Y}\tag{13}$ where $\mathbf{D}$ and $\mathbf{A}$ is the degree matrix and adjacent matrix, respectively. $t$ is the number of propagation. $\mathbf{Y}=\begin{bmatrix} \mathbf{Y}_l \\ \mathbf{0} \end{bmatrix}$ is the vector of labels on nodes. $\mathbf{D}^{-1}\mathbf{A}$ is the transition matrix. <h4 id="graph-kernel-methods-for-graph-classification">Graph kernel methods for graph classification</h4> Graph Kernel method is to measure the similarity between two graphs with a kernel function which corresponds to an inner product in reproducing kernel Hilbert space (RKHS). Kernel methods are widely utilized in the Support Vector Machine. It allows us to model higher-order features in the original feature space without computing the coordinates of the data in a higher dimensional space. Graph kernel methods confront additional challenges than the general kernel methods on how to encode the similarity on the graph structure. The design of graph kernel methods focuses on finding suitable graph patterns to measure similarity. We will briefly introduce the subgraph pattern and path pattern on graph kernels. Graph kernels based on subgraphs aims to find the same subgraph between graphs. Two graphs with more same subgraphs are more similar. Subgraph set can be defined by the graphlet, which is an induced and non-isomorphic sub-graph of node size-$k$. An illustration can be found in Fig.3 A pattern count vector $\mathbf{f}$ will be calculated where $i^{\text{th}}$ component denotes the frequency of subgraph pattern $i$ occurs. The graph kernel can then be defined as: $\mathcal{K}_{\text{GK}}(\mathcal{G}, \mathcal{G}')= \left \langle \mathbf{f}^{\mathcal{G}} \mathbf{f}^{\mathcal{G}'} \right \rangle$ where $\mathcal{G}$ and $\mathcal{G}'$ are two graph, $\left \langle \cdot, \cdot \right \rangle$ denotes the Euclidean dot product. <center>Figure 6: Connected, non-isomorphic induced sub-graph of node size $$k \le 5$$</center> Graph kernels based on path decomposes a graph into paths. It takes the co-occurrence of random-walk on two graphs to calculate the similarity. Different from the subgraph-based methods focusing on the graph structure, random-walk based method takes the node label in the graph into consideration. It counts all shortest paths in graph $\mathcal{G}$ denoting as triplets $p_i=(l_s^i, l_e^i, n_k )$. $n_k$is the length of the path. $l_s^i$ and $l_e^i$ are the labels of the starting and ending vertices, respectively. Similarly, the graph kernel can be defined as: $\mathcal{K}_{\text{GK}}(\mathcal{G}, \mathcal{G}')= \left \langle \mathbf{f}^{\mathcal{G}} \mathbf{f}^{\mathcal{G}'} \right \rangle\tag{15}$ where the $i^{\text{th}}$ component of $\mathbf{f}$ denotes the frequency of triplet occurring. <h4 id="heuristic-methods-for-link-prediction">Heuristic methods for link prediction</h4> Heuristic methods, i.e., Common Neighbor, utilize the graph structure to estimate the likelihood of the existence of links. We will briefly introduce some basic heuristic methods including common neighbors, Jaccard score, preferential attachment, and Katz index. $\Gamma(x)$ denote the neighbor node set of $x$. $x$ and $y$ denote two different nodes. Common Neighbors (CN): The Common Neighbors algorithm considers two nodes with more overlapping neighbor nodes are more likely to be connected. Common neighbors algorithm calculates the intersection between neighbor nodes of node $x$ and node $y$. $f_{\text{CN}}(x,y)=| \Gamma(x) \cap \Gamma(y) |\tag{16}$ Jaccard score: Jaccard score can be viewed as a normalized Common Neighbors algorithm, where the normalized factor is union of node sets. $f_{\text{Jaccard}}(x,y)=\frac{| \Gamma(x) \cap \Gamma(y) |}{| \Gamma(x) \cup \Gamma(y) |}\tag{17}$ Preferential attachment (PA): Preferential attachment algorithms consider that nodes with higher degrees are more likely to be connected. Preferential attachment calculates the product of node degrees. $f_{\text{PA}}(x,y)=| \Gamma(x) | \times | \Gamma(y) |\tag{18}$ Katz index Katz index algorithm takes high-order nodes into consideration compared with the above algorithms based on one hop neighborhood. Katz index considers that nodes with more short paths are more likely to be connected. It calculates the weighted sum of all the walks between $x$ and $y$ as follows: $f_{\text{Katz}}(x,y)= \sum_{l=1}^{\infty}\beta^l |\text{walks}^{\left \langle l \right \rangle }(x,y)|\tag{19}$ $\beta$ is a decaying factor between 0 and 1, which gives a smaller weight to distant path. $|\text{walks}^{\left \langle l \right \rangle }$ counts the length between $x$ and $y$. <h1 id="applying-graph-machine-learning-in-scientific-discovery">Applying Graph Machine Learning in Scientific Discovery</h1> In this section, we will first introduce some general tips for applying graph machine learning in scientific discovery followed by two success examples in molecular science and social science. <h2 id="tips-for-applying-graph-machine-learning">Tips for Applying Graph Machine Learning</h2> <h4 id="efficiency-issues-on-graph">efficiency issues on graph</h4> <ul> <li> If your task focuses on a single large graph, it may meet the out-of-memory issue. We suggest you (1) utilize sampling strategies (2) less propagation layer without involving too many neighbors. </li> <li> If your task focuses on multiple small graphs, time efficiency may be an issue. (Seems that GNN can be very slow on mini-batch task) </li> </ul> <h4 id="effective-issues-on-graph">effective issues on graph</h4> <ul> <li> feature matters: if your graph node does not have the feature, you can conduct the feature manually. Some suggested features are degree, Laplacian Eigenvector, DeepWalk embedding. </li> <li> feature normalization may heavily influence the performance of GNN models. </li> <li> add self-loop may provide additional gain to your model </li> <li> The performance on single data split may not be reliable. Try different data splits for reliable performance. </li> </ul> <h4 id="when-graph-may-not-work">when graph may not work</h4> <ul> <li> If your data does not naturally have the graph structure, it may not be necessary to conduct graph structure manually to apply GNN methods on. </li> <li> GNN is a permutation equivalence Neurel Network. It may not work well on tasks requiring other geometric properties and also nodes related to other information. </li> </ul> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Algorithm 1: Input: molecule, radius R, fingerprint length S Initialize: fingerprint vector f ← 0_S r_a ← g(a) r_1 ... r_N = neighbors(a) v ← [r_a, r_1, ..., r_N] r_a ← hash(v) i ← mod(r_a, S) f_i ← 1 Return: binary vector f </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Algorithm 2: Input: molecule, radius R, hidden weights H_1^1 ... H_R^5, output weights W_1 ... W_R Initialize: fingerprint vector f ← 0_S r_a ← g(a) r_1 ... r_N = neighbors(a) v ← r_a + Σ_{i=1}^{N} r_i r_a ← σ(vH_L^N) i ← softmax(r_a W_L) f ← f + i Return: real-valued vector f </code></pre></div></div> <center>Figure 7: Pseudocode of circular fingerprints (up) and neural graph fingerprints (down). Differences are highlighted in blue. Every non-differentiable operation is replaced with a differentiable analog. (algorithm taken directly from the paper [16].)</center> <h2 id="success-in-modeling-molecular-structures-chemistrybiology">Success in Modeling Molecular Structures (Chemistry/Biology)</h2> Molecules are one of the most common applications for graph neural networks, especially message passing neural networks. Molecules are naturally graph objects and GNNs provide a compact way to learn representations on molecular graphs. This line of work has been opened up by a seminal work NEF [16] where they built a neat connection between the process of constructing the most commonly used structure representation (molecular fingerprints) and graph convolutions. As shown in Algorithm [2]. It is worth noting that the commonly used string encoding for molecules (SMILES — Simplified Molecular Input Line Entry System) could be considered as a parsing tree (implicit graph representation) defined by the grammar. <center>Figure 8: Representing crystal structures with GNNs (handling periodicity). (Figure taken from [17])</center> <center>Figure 9: Designing protein sequences via a graph-to-sequence model. (Figure taken from [18])</center> There are mainly two branches of problems that have been discovered extensively with graph representation and graph neural networks: (1) predictive task, and (2) generative task. Predictive task refers to answering a specific question about certain molecules, such as the toxicity, energy, etc. of any given molecules. This is particularly beneficial for tasks like virtual screening which otherwise requires experiments to obtain the property of molecules. On the other hand, generative task aims to design and discover new molecules with certain interesting properties which is also called molecular inverse design. For predictive task, graph representation provides an efficient and effective way to encode the graph structure of molecules and lead to better performance in any downstream tasks of interest. For generative task, graph representation enables us to design the generative process in a more flexible way as the graph representation can be mapped to molecules deterministically. Another research hot spot in modeling molecules with graphs is molecular pre-training which arises from the real-world applications. As the chemical space is gigantic (estimated to be $10^{23}$ to $10^{60}$ for small drug-like molecules, our explored areas are very limited. However, we have much more access to molecular structures without property annotations. This motivates the research into leveraging unlabeled molecular structures to learn general and transfferable representations which could be fine-tuned in any task even with a small amount of available labeled data. <center>Figure 10: Graph generation with diffusion models. (Figure taken from [19])</center> Last but not least, the work we briefly talked about above is mostly about small drug-like molecules. However, graph representation is much more widely applied in a variety of molecules, such as proteins, RNAs (large bio-molecules), crystal structures or materials (with periodicity), etc. Also, we mainly focus on 2D graph representation in this blog, we will defer discussions about 3D graph representation to a later blog. <h2 id="success-in-modeling-social-networks-social-sciences">Success in Modeling Social Networks (Social Sciences)</h2> Graphs are naturally well-suited as a mathematical formalism for describing and understanding social systems, which usually involve a number of people and their interpersonal relationships or interactions. The most well-known practice in this regard is the concept of social networks, where each person is represented by a vertex (node), and the interaction or relationship between two persons, if any, is represented by an edge (link). The practice of using graphs to study social systems dates back to the 1930s when Jacob Moreno, a pioneering psychiatrist and educator, proposed the graphic representation of a person’s social link, known as the sociogram [20]. The approach was mathematically formulated in the 1950s and became common in social science later in the 1980s. Zachary’s karate club. To motivate the study of social networks, it is worth introducing Zachary’s karate club [21] as an example to start with. Zachary’s karate club refers to a university karate club studied by Wayne Zachary in the 1970s. The club had 34 members. If two members interacted outside the club, Zachary created an edge between their corresponding nodes in the social network representation. Figure 10 shows the resulted social network. What makes this social network interesting is that during Zachary’s study, a conflict arose between two senior members (node 1 and node 34 in the figure) of the club. All other members had to chosen sides among the two senior members, essentially leading to a split of the club into two subgroups (i.e. “communities”). As the figure shows, there are two communities of nodes centered at node 1 and 34 respectively. Zachary further analyzed this network, and found that the exact split of club members can be identified purely based on the structure of the social network. Briefly speaking, Zachary runs a min-cut algorithm on the collected social network. The min-cut algorithm essentially serves to return a group of edges as the “bottleneck” spot of the whole social network. The nodes on different sides of the “bottleneck” are determined to belong to different splits. It turned out that Zachary was able to precisely identify the community belongings for all nodes except node 9 (which indeed lies right on the boundary as the figure shows). This example has often been used as a great example to suggest the fact that social networks (graphs) are a powerful formalism for revealing the underlying organizational truths of social systems. <center>Figure 11: The Zachary's karate club as a motivating example for studying social networks.</center> Important domains of study. The research of social networks grew rapidly in the past few decades, and has spawned many branches. Exhausting all those branches will certainly go beyond the scope and capacity of this blog. Hereby we briefly survey a few of the most influential ones as the following. <ul> <li> Static structure. The first step towards understanding social networks is to analyze their static structural properties. The effort involves the development of scientific measures to quantify those properties, and the empirical measurement of them on real-world social networks. Generally speaking, a social network can be analyzed at local and global levels. At local level, node centrality measures the “importance” of a person with respect to the whole network. Popular examples include degree centrality, betweenness centrality [22], closeness centrality [23], eigenvector centrality [24], PageRank centrality [25], etc. These measures differ by the different aspects of social importance they emphasize on. For example, the eigenvector centrality $x_i$ of a person (node) $i$ is defined in a recursive manner as: $$\begin{aligned} x_i = \frac{1}{\lambda} \sum_{j\in\mathcal{N}(i)} x_j \end{aligned}$where$\lambda$$ is the largest eigenvalue of the adjacency matrix of the social network (and is guaranteed to be a real, positive number). This centrality measure is underpinned by the principle that a person’s role is considered more important if that person have connections with more important people. We refer interested readers to [24] for more details. Besides node centrality, another example of local measurement is clustering coefficients [26], which measures the tendency of “triadic closure” around a center node: $$\begin{aligned} c_i = \frac{|{e_{jk} \in E: j,k\in \mathcal{N}(i)}|}{k_i(k_1-1)/2} \end{aligned}$$ At global level, network distances and modularity are two measures for characterizing the macro structure of a social network. Popular network distance measures include shortest-path distances, random-walk-based distances, and (physics-inspired) resistance distance. Conceptually, they may be viewed as quantifiers of “difficulty” to travel along the edges of the social network from one node to another. Modularity often accompanies the important task of community detection for social networks. It measures the strength of division of a social network into groups or clusters of well-connected people. </li> <li> Dynamic structure. Real-world social interactions often involve time-evolving processes. Therefore, many studies on social networks explicitly incorporate temporal information into the modeling. The task of link prediction, for example, has often been introduced in attempts to model the evolution of a social network. The task predicts whether a link will appear between two people at some (given) future time, and thereby predicting the evolution of the social network. Another area where dynamic structures of social networks are often discussed is when they are used to model face-to-face social interactions. Some of the most recent works on this regard abstract people’s interaction traits such as eye movement , eye gazing, “speaking to” or “listening to” relationships into attribute-rich dynamic links. It is believed that these dynamic interactions carry crucial information about the social event and people’s personalities. Therefore, using a temporal graph that explicitly models these interactions would greatly help the analysis of social interactions of such kind. For example, in [@wang2021tedic], researchers found that using a temporal graph to build prediction models helps machines to achieve state-of-the-art accuracy in identifying lying, dominance, as well as nervousness of people when they interact with each other in a role-playing game. </li> <li> Information flow. Sometimes the structure of social networks is not the ultimate target of interest to researcher. Instead, people care about the fact that their opinions and decision making process are often affected by their social interactions with friends and acquaintances. Therefore, social networks are often regarded as the infrastructure on which information flows and opinion propagates. It is thus crucial to know how social networks of different structures can affect the spreading of information. A long line of works, for example, has been focusing on modeling the so-called opinion dynamics on social networks. Research in this area has seen such successful applications to viral marketing [28], international negotiations [29], as well as resource allocation [30]. There are many opinion dynamics models, and all of which are essentially mathematical models that describes how people’s opinion(s) on some matters, represented as numerical value(s), dynamically affect each other following some mathematical rules that rely on the network structure. Some of the most popular opinion dynamic models include voter’s model [31], Snajzd Model [32], Ising model [33], Hegselmann-Krause (HK) model [34], Friedkin-Johnsen (FJ) model [35] etc. Here we introduce Friedkin-Johnsen model as an example. The FJ model is not popular as a hot area to study by social scientists in recent years, but is also to date the only model on which a sustained line of human-subject experiments has confirmed the model’s predictions of opinion changes. The basic assumption of FJ model two opinions helf by each person $i$ in the social network: an internal opinion $s_i$ that is always fixed, and an external opinion $z_i$ that evolves in adaption to $i$’s internal opinion and its neighbors’ external opinions. The evolution of external opinion $s_i$ along time steps follows the rule: $\begin{aligned} z^{0}_i &= s_i\\ z^{t+1}_i &= \frac{s^t_i+\sum_{j\in N_i}a_{ij}z^t_i}{1+\sum_{j\in N_i}a_{ij}} \end{aligned}$ where $N_i$ is the neighbors of node $i$, $a_{ij}$ is the interaction strength between persons $i$ and $j$. One very elegant property of the FJ model is that the expressed opinions will reach a closed-form equilibrium eventually: $\begin{aligned} z^{\infty} = (I+L)^{-1}s \end{aligned}$ where $z^{\infty}, s\in \mathbb{R}^{|V|}$ are the opinion vectors. This closed-form equilibrium brings tremendous convenience for the many follow-up works [36,37,38,39] to further define indices of, for example, polarization, disagreement, and conflict on the equilibrium opinions. </li> </ul> <h1 id="learning-resources">Learning Resources</h1> <ul> <li> <a href="https://graph-neural-networks.github.io/index.html">Graph Neural Networks Foundations, Frontiers, and Applications</a> </li> <li> <a href="https://web.njit.edu/~ym329/dlg_book/index.html">Deep Learning on Graphs</a> </li> <li> <a href="http://web.stanford.edu/class/cs224w/">CS224W: Machine Learning with Graphs</a> </li> <li> <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8347162">Graph Signal Processing: Overview, Challenges, and Applications</a> </li> <li> <a href="http://cse.msu.edu/~mayao4/tutorials/aaai2021/">Graph Neural Networks: Models and Applications</a> </li> <li> <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9046288">A Comprehensive Survey on Graph Neural Networks</a> </li> </ul> <h2 id="references">References</h2> [1] Linton Freeman. The development of social network analysis. A Study in the Sociology of Science, 1(687):159–167, 2004. [2] Michael GH Bell, Yasunori Iida, et al. Transportation network analysis. 1997. [3] Jon Kleinberg and Steve Lawrence. The structure of the web. Science, 294(5548):1849–1850, 2001. [4] Ed Bullmore and Olaf Sporns. The economy of brain network organization. Nature reviews neuroscience, 13(5):336–349, 2012. [5] Kristel Van Steen. Travelling the world of gene–gene interactions. Briefings in bioinformatics, 13(1):1–19, 2012. [6] Minoru Kanehisa, Susumu Goto, Miho Furumichi, Mao Tanabe, and Mika Hirakawa. Kegg for representation and analysis of molecular networks involving diseases and drugs. Nucleic acids research, 38(suppl_1):D355–D360, 2010. [7] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. [8] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. [9] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [10] Yao Ma and Jiliang Tang. Deep learning on graphs. Cambridge University Press, 2021. [11] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations. [12] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202–1211, 2021. [13] Francesco Di Giovanni, James Rowbottom, Benjamin P Chamberlain, Thomas Markovich, and Michael M Bronstein. Graph neural networks as gradient flows. arXiv preprint arXiv:2206.10991, 2022. [14] Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. What can neural networks reason about? In International Conference on Learning Representations. [15] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1365–1374, 2015. [16] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems, 28, 2015. [17] Tian Xie and Jeffrey C Grossman. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters, 120(14):145301, 2018. [18] Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022. [19] Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. arXiv preprint arXiv:2209.14734, 2022. [20] Jacob Levy Moreno. Who shall survive?: A new approach to the problem of human interrelations. 1934. [21] Wayne W Zachary. An information flow model for conflict and fission in small groups. Journal of anthropological research, 33(4):452–473, 1977. [22] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages 35–41, 1977. [23] Alex Bavelas. Communication patterns in task-oriented groups. The journal of the acoustical society of America, 22(6):725–730, 1950. [24] Mark EJ Newman. The mathematics of networks. The new palgrave encyclopedia of economics, 2(2008):1–12, 2008. [25] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7):107–117, 1998. [26] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature, 393(6684):440–442, 1998. [27] Yanbang Wang, Pan Li, Chongyang Bai, and Jure Leskovec. Tedic: Neural modeling of behavioral patterns in dynamic social interaction networks. In Proceedings of the Web Conference 2021, pages 693–705, 2021. [28] Wei Chen, Yajun Wang, and Siyu Yang. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 199–208, 2009. [29] Carmela Bernardo, Lingfei Wang, Francesco Vasca, Yiguang Hong, Guodong Shi, and Claudio Altafini. Achieving consensus in multilateral international negotiations: The case study of the 2015 paris agreement on climate change. Science Advances, 7(51):eabg8068, 2021. [30] Noah E Friedkin, Anton V Proskurnikov, Wenjun Mei, and Francesco Bullo. Mathematical structures in group decision-making on resource allocation distributions. Scientific reports, 9(1):1377, 2019. [31] Richard A Holley and Thomas M Liggett. Ergodic theorems for weakly interacting infinite systems and the voter model. The annals of probability, pages 643–663, 1975. [32] Katarzyna Sznajd-Weron and Jozef Sznajd. Opinion evolution in closed community. International Journal of Modern Physics C, 11(06):1157–1165, 2000. [33] Sergey N Dorogovtsev, Alexander V Goltsev, and José Fernando F Mendes. Ising model on networks with an arbitrary distribution of connections. Physical Review E, 66(1):016104, 2002. [34] Hegselmann Rainer and Ulrich Krause. Opinion dynamics and bounded confidence: models, analysis and simulation. 2002. [35] Noah E Friedkin and Eugene C Johnsen. Social influence and opinions. Journal of Mathematical Sociology, 15(3-4):193–206, 1990. [36] Cameron Musco, Christopher Musco, and Charalampos E Tsourakakis. Minimizing polarization and disagreement in social networks. In Proceedings of the 2018 world wide web conference, pages 369–378, 2018. [37] Christopher Musco, Indu Ramesh, Johan Ugander, and R Teal Witter. How to quantify polarization in models of opinion dynamics. arXiv preprint arXiv:2110.11981, 2021. [38] Xi Chen, Jefrey Lijffijt, and Tijl De Bie. Quantifying and minimizing risk of conflict in social networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1197–1205, 2018. [39] Shahrzad Haddadan, Cristina Menghini, Matteo Riondato, and Eli Upfal. Repbublik: Reducing polarized bubble radius with link insertions. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 139–147, 2021. </article> <article> <h1>Matrix Factorization</h1> 2022-06-03T00:00:00+00:00 <h1 id="matrix-factorization---a-graph-perspective">Matrix Factorization – a graph perspective</h1> The matrix factorization is one important techinique on learning representation. The main application of the matrix factorization are three-folds. <ul> <li>Dimensiton reduction or fit the missing entrance (low-rank recovery)</li> <li>Data cluster using the principal components $Q$ find the low rank representation for data compression or find the hidden cluster in the Matirx Factorization</li> </ul> For the low rank approximation with missing data, the matrix factorization can be formed as: $\min_{U,V} ||W \odot (M-UV)||$ where $W$ is the mask for the existing entrance. $M$ is the input matrix. However, this optimization will be ill-posed, which will definitely leads to overfitting. Addition assumption has to be made, the most importance is that low rank. It can be explain in an hard method as: $UV$. or it can be utilized with a soft constraint->nuclear norm. However, this constraint is somehow computational expensive. Why matrix factorization? the matrix can be represented with a linear reductive structure, this structure will not change much on the different conditions, which is more robust. How to find this kind of subspace becomes a key problem. **attention please what is the really $L$, normalized or not. ** <h2 id="basic-model">Basic model</h2> <h3 id="pca">PCA</h3> Let the input data feature matrix $X=(x_1, \cdots , x_n)\in \mathbb{R}^{p\times n}$, PCA aims to find the optimal low-dimensional subspace with the principal direction $U=(u_1, \cdots , u_k)\in \mathbb{R}^{p\times k}$ , and the projected data point $V=(v_1, \cdots , v_n)\in \mathbb{R}^{n\times k}$. It aims to minimize the covariance with the following loss function: $\min _{U, V}\left\|X-U V^{T}\right\|_{F}^{2} \text { s.t. } V^{T} V=I$ (projection direction和projected data point是我搞混了) Notice that, the data is already centelized in the dataset. <h3 id="lda">LDA</h3> <h3 id="spectral-cluster">Spectral Cluster</h3> <h2 id="graph-based-method">Graph based method</h2> In the above discussion, the input data $X$ is the only avaible vector data for learning a data representation. And for the manifold learning and graph embedding methods, only the graph adjacent matrix $W$ is taking into consideration. However, there lacks of methods which takes both graph structure $W$ and the node feature $X$ into consideration. We first answer the question that how can those methods benefit from each other. In most case, the feature-based matrix factorization on $X$ can only considers on the linear case, while the manifold learning can consider the local information (local linear -> global nonlinear relationship). They can find the data lies in a nonlinear data manifold. (In some cases, there is no $W$, graph is constructed based on the feature similarity.) <h3 id="graph-laplacian-pca-closed-form-solution-and-robustness">Graph-Laplacian PCA: Closed-form Solution and Robustness</h3> To take both the PCA and the Laplacian embedding into one framework, the target is as follow: $\begin{array}{l} \min _{U, Q} J=\left\|X-U Q^{T}\right\|_{F}^{2}+\alpha \operatorname{Tr}\left(Q^{T}(D-W) Q\right) \\ \text { s.t. } Q^{T} Q=I \end{array}$ The solution is that: $\begin{array}{l} Q^{*}=\left(v_{1}, v_{2}, \cdots, v_{k}\right) \\ U^{*}=X Q^{*} \end{array}$ where $\left(v_{1}, v_{2}, \cdots, v_{k}\right)$ is the eigenvectors corresponding to the first $k$ smallest eigenvalues of $G = -X^TX + \alpha L$ 这里不太明白的是矩阵的求导的过程。 We first need to find the solution by fix $Q$, then we find the result for $U$. Finally, in order to balance two terms easily, or keep them in the same scale, the data is normalized by $\lambda_n$, the largest eigenvalue of $X^TX$, and the $\epsilon_n$, the largest eigenvalue of Laplacian matrix $L$. **A question: how to connect is with the graph neural networ, I think dual may be a solution ** However, the above one is not as robust, a robust version with a weaken norm is given then $\begin{array}{l} \min _{U, Q} J=\left\|X-U Q^{T}\right\|_{2,1}^{2}+\alpha \operatorname{Tr}\left(Q^{T}(D-W) Q\right) \\ \text { s.t. } Q^{T} Q=I \end{array}$ In order to give into a Augmented Lagrange Multiplier feasible form, it can be rewritten as: $\begin{array}{l} \min _{U, Q, E}\|E\|_{2,1}+\alpha \operatorname{Tr} Q^{T}(D-W) Q \\ \text { s.t. } E=X-U Q^{T}, Q^{T} Q=I \end{array}$ I do not know much about the proximal optimizer and Augmented Lagrange Multipler, also need knowledge on the gradient on the nrom Then, it can be written as the two argumented term of $E-X+U Q^{T}$ as: $\begin{array}{l} \min _{U, Q, E}\|E\|_{2,1}+\operatorname{Tr} C^{T}\left(E-X+U Q^{T}\right) \\ \quad+\frac{\mu}{2}\left\|E-X+U Q^{T}\right\|_{F}^{2}+\alpha \operatorname{Tr} Q^{T} L Q \\ \text { s.t. } Q^{T} Q=I \end{array}$ Fix $E$, the problem is the same as the original one, while fixed $U$ and $Q$, the problem change as follows: $\min _{E}\|E\|_{2,1}+\frac{\mu}{2}\|E-A\|_{F}^{2}$ where $A = X-UQ^T-C/\mu$, which is viewed as a group, which can then be viewed as $n$ independent matrix as: $\min _{e_{i}}\left\|e_{i}\right\|+\frac{\mu}{2}\left\|e_{i}-a_{i}\right\|^{2}$ Then the constraint parameter can be update as $\begin{array}{l} C=C+\mu\left(E-X+U Q^{T}\right) \\ \mu=\rho \mu \end{array}$ 这里有一个比较大的问题，是project的data尽量正交和选的坐标基底尽可能正交 L是否normalized 是否会有影响 <h3 id="connecting-graph-convolutional-networks-and-graph-regularized-pca">Connecting Graph Convolutional Networks and Graph-Regularized PCA</h3> Similarly, there are another work on the PCA based network, however, the representation constraint is not the same with the original one. The target is that: $\begin{array}{l} \min _{U, Q} J=\left\|X-Z W^{T}\right\|_{F}^{2}+\alpha \operatorname{Tr}\left(Z^{T}\tilde{L} Z\right) \\ \text { s.t. } W^{T} W=I \end{array}$ Notice that, the smooth term is different, it aims to smooth the representation, but not the coodinate. The solution steps are similar with the above one, and the solution is that: $\begin{aligned} W^{*} &=\left(\mathbf{w}_{1}, \mathbf{w}_{2}, \ldots, \mathbf{w}_{k}\right) \\ Z^{*} &=(I+\alpha \tilde{L})^{-1} X W^{*} \end{aligned}$ where $\mathbf{w}{1}, \mathbf{w}{2}, \ldots, \mathbf{w}_{k}$ are the eigenvectors corresponding to the largest $k$ eigenvalues of the matrix $X^T(I+\alpha \tilde{L})^{-1}X$ So the question is that: <ul> <li>What is the different between find the graph-based projection and graph-based data-> correlation with the graph signal process</li> <li>How to set the centerize the feature into the real practice scenario.</li> <li>What is the connection between LDA, and the weight of neural network</li> <li>What is the correlation with our form</li> <li>How to transform this kind of duration form</li> </ul> 貌似现在的分析只能对应的是中间矩阵是半正定的情况。 <h3 id="low-rank-matrix-approximation-with-manifold-regularization">Low-Rank Matrix Approximation with manifold regularization</h3> 矩阵分解和直接优化对应的目标，进行分解有什么本质的区别？感觉似乎是没有的 The optimization forms takes that: $\begin{array}{l} \min _{U, Y} J=\left\|A-UY\right\|_{F}^{2}+\alpha \operatorname{Tr}\left(Y\tilde{L} Y^T\right) \\ \text { s.t. } U^{T} U=I \end{array}$ The solution is then find the optimized $(U, Y)$ pair, since results in the same result with $(UQ, Q^TY)$. The formulation can be written without a factorization form. $\min_{\text{rank(X)}\le r}\left\|A-X\right\|_{F}^{2}+\alpha \operatorname{Tr}\left(X\Phi X^T\right)$ where $X=UY$, does not result in the final result. As $\Phi$ can be the Laplacian matrix, which is semi-positive in most cases, the square root result, can be written as $\Phi+I = B^TB$, the $I$ is from the first term. Then $f(X)=||A||_F^2-2tr(XBB^{-1}A^T)+||XB||_F^2)=||A||_F^2+||XB-AB^{-1}||_F^2-||AB^{-1}||_F^2)$ The other part of the method with the latenate iteration algorithm is similar with the above one. Algorithm suppose the solution by the SVD method, which is not the key in our analysis <h2 id="robust-pca-algorithm-with-graph-constraint">Robust PCA algorithm with graph constraint</h2> Different from the above methods, the robust based method is more robust to the outliner, while the original PCA use the L2 norm may effect by this form since it is based on the gaussian noise assumption. L1 norm seems to be more robust, but it is non-convecx and hard to be optimized. Below methods are trying to solve this kind of situation. <h3 id="practical-low-rank-matrix-approximation-under-robust-l1-norm">Practical Low-Rank Matrix Approximation under Robust L1-Norm</h3> This paper majorly add a convex optimizer trace regularization to avoid oversmooth, and argumented Lagrange multiplier is utilized a new optimizer. The problem still with the matrix factorization framework can be found as: $\min_{U,V} ||W \odot (M-UV)||_1 s.t. U^U = I$ The constraint is to avoid too many pairs appear in the final result. To give a low rank smooth optimizer, a small nuclear norm is given as a regualrization term as $\min_{U,V} ||W \odot (M-UV)||_1+\lambda ||V||_* s.t. U^U = I$ Then the problem is sent to a ALM problem solver for the final result. induce that $E=UV$. $\begin{aligned} f(E, U, V, L, \mu)=&\|W \odot(M-E)\|_{1}+\lambda\|V\|_{*}+\\ &\langle L, E-U V\rangle+\frac{\mu}{2}\|E-U V\|_{F}^{2}, \end{aligned}$ Then $U$ is solved via Orthogonal Procrustes, $V$ is solved via Singular Value Shrinkage, and $E$ is solved via Absolute Value Shirnkage. <h3 id="robust-principal-component-analysis-on-graphs">Robust Principal Component Analysis on Graphs</h3> We need to notice that at first that the below method do not have additiona graph structure for learning, it is conducted based on the feature similarity. Graph can improve the cluster property due to the graph smoothness assumption on the low-rank matrix. Different from the above work, this paper abandon the explict matrix factorization but an add framework. Thr proposed model is as follows: $\begin{array}{l} \min _{L, S}\|L\|_{*}+\lambda\|S\|_{1}+\gamma \operatorname{tr}\left(L \Phi L^{T}\right) \text {, } \\ \text { s.t. } X=L+S, \end{array}$ where $S$ is the sparse error, while $L$ is the low-rank approximation of $X$. The final term are defined the smoothness on the graph structure. Then the problem can be rewritten as the following function to let $L$ become a condition for $L$ in ALM solver. $\begin{array}{l} \min _{L, S}\|L\|_{*}+\lambda\|S\|_{1}+\gamma \operatorname{tr}\left(W \Phi W^{T}\right) \text {, } \\ \text { s.t. } X=L+S, L=W \end{array}$ Then for each index, an lagrange multiplier is give as $Z_1\in \mathbb{R}^{p\times n}$ and $Z_2\in \mathbb{R}^{p\times n}$, Then the problem can be transformed into: $\begin{aligned} (L, S, W)^{k+1} &=\underset{L, S, W}{\operatorname{argmin}}\|L\|_{*}+\lambda\|S\|_{1}+\gamma \operatorname{tr}\left(W \Phi W^{T}\right) \\ &+\left\langle Z_{1}^{k}, X-L-S\right\rangle+\frac{r_{1}}{2}\|X-L-S\|_{F}^{2} \\ &+\left\langle Z_{2}^{k}, W-L\right\rangle+\frac{r_{2}}{2}\|W-L\|_{F}^{2}, \\ Z_{1}^{k+1} &=Z_{1}^{k}+r_{1}\left(X-L^{k+1}-S^{k+1}\right), \\ Z_{2}^{k+1} &=Z_{2}^{k}+r_{2}\left(W^{k+1}-L^{k+1}\right), \end{aligned}$ Then the problem can be solved by: $\begin{array}{l} L^{k+1}=\operatorname{prox}_{\frac{1}{\left(r_{1}+r_{2}\right)}}\|L\|_{*}\left(\frac{r_{1} H_{1}^{k}+r_{2} H_{2}^{k}}{r_{1}+r_{2}}\right), \\ S^{k+1}=\operatorname{prox}_{\frac{\lambda}{r_{1}}}\|S\|_{1}\left(X-L^{k+1}+\frac{Z_{1}^{k}}{r_{1}}\right) \\ W^{k+1}=r_{2}\left(\gamma \Phi+r_{2} I\right)^{-1}\left(L^{k+1}-\frac{Z_{2}^{k}}{r_{2}}\right) \end{array}$ Assuming that a p-nearest neighbors graph is available, there are several methods to construct neighborhoods are <ul> <li>binary</li> <li>heat kernel</li> <li>correlation distance</li> </ul> <h3 id="fast-robust-pca-on-graphs">Fast Robust PCA on Graphs</h3> Similar with the above paper, this paper give mode detailed on how the graph based on feature similarity can enhance the performance. In this methods, it introduce the graph smoothness on both samples and features smoothness, also the method can show clear cluster under some theoritical condition. The target is as follow $\begin{array}{l} \min _{U, S}\|S\|_{1}+\gamma_{1} \operatorname{tr}\left(U \mathcal{L}_{1} U^{\top}\right)+\gamma_{2} \operatorname{tr}\left(U^{\top} \mathcal{L}_{2} U\right), \\ \text { s.t. } X=U+S, \end{array}$ where $U$ is not constaint as the low dimensional representation. The optimzation procedure is used via two graph constraints with Fast Iterative Soft Thresholding Algorithm The graph is constructed with: $A_{i j}=\left\{\begin{array}{ll} \exp \left(-\frac{\left\|\left(x_{i}-x_{j}\right)\right\|_{2}^{2}}{\sigma^{2}}\right) & \text { if } x_{j} \text { is connected to } x_{i} \\ 0 & \text { otherwise. } \end{array}\right.$ Two graphs are based on the sample similarity and data similarity respectively, how can they give us more information. The graph of feature can provide a basis for data, which is well aligned with the corvariance matrix $C$. The graph of samples provdethe embedding which has the similar interpretation as PCA. In a word, the Laplacian matrix has some similarity with the PCA based method. Therefore, the low rank matrix should be able to represent by a linear combination of the feature and samples vector. The result is bounded by the gap between eigenvalues $\begin{array}{c} \phi\left(U^{*}-X\right)+\gamma_{1}\left\|U^{*} \bar{Q}_{k_{1}}\right\|_{F}^{2}+\gamma_{2}\left\|\bar{P}_{k_{2}}^{\top} U^{*}\right\|_{F}^{2} \\ \leq \phi(E)+\gamma\left\|X^{*}\right\|_{F}^{2}\left(\frac{\lambda_{k_{1}}}{\lambda_{k_{1}+1}}+\frac{\omega_{k_{2}}}{\omega_{k_{2}+1}}\right) \end{array}$ <h3 id="deep-matrix-factorization-with-spectral-geometric-regularization">Deep Matrix Factorization with Spectral Geometric Regularization</h3> The Deep Matrix is similar with the DNN which that as the original matrix facotirxzaiton always give a binary factors like $X=X_1X_2$, it gives another form as $X=\prod_{i=1}^NX_i$. The product graph is give as the Cartesian product of $\mathcal{G}_1$ and $\mathcal{G}_2$, where the Laplacian matrix can be represetned as $L_{\mathcal{G}_1 \Box \mathcal{G}_2 } = L_1 \otimes I + I \otimes L_2$ And the function is defined by the eigenvectors from both individual Laplacian matrix: $\Phi , \Psi$, $C$, the function map is defined as $C=\Phi^TX\Psi$, which map between the functional map between the function space of $\mathcal{G}_1$ and $\mathcal{G}_2$. It can also be called the signal on the product graph. The following property can be found. $\alpha = \Phi^Tx=C\Psi^Ty=C\beta$ for $x=\Phi^T\alpha$ and $y=\Psi^T\beta$. The optimization object is as follows: $\min_X E_{data}(X)+\mu E_{dir}(X) s.t. rank(X) \lt r$ The dirichlet energy is $E_{dir}(X)=tr(X^TL_rX)+tr(X^TL_cX)$ then we decompose $X$ as $X=AZB^T$, $Z$ is the signal lies in the latent product graph For those three factor can also be factorized as: $\begin{array}{l} \boldsymbol{Z}=\boldsymbol{\Phi}^{\prime} \boldsymbol{C} \boldsymbol{\Psi}^{\prime \top} \\ \boldsymbol{A}=\Phi \boldsymbol{P} \Phi^{\prime \top} \\ B=\boldsymbol{\Psi} Q \boldsymbol{\Psi}^{\prime \top} \end{array}$ The objective can be transformed into $\min_{P,C,Q} ||(\Phi PCQ^T\Psi^T-M)||_F^2+tr(QC^TP^T\Lambda_rPCQ^T)+tr(PCQ^T\Lambda_cQC^TP^T)$ <h3 id="matrix-decomposition-on-graphs-a-functional-view">Matrix Decomposition on Graphs: A Functional View</h3> Once we have two graphs, it is natural to think about the correlation between those graph, which is the function on the product graph. It tries to give a unify view on the geometric matrix completion and graph regularized dimension reduction. We give the form $X=\Phi C\Psi^T$ The matrix factorization can establish for basis consistency as the low dimension representation of $X$ can be represented as the span of $\Psi$ and $\Phi$. Then it requires the correspondance with each eigenvalue as: $E_{reg}=||C\Lambda_r-\Lambda_cC||^2$ where $\Lambda$ is the eigenvalue of the graph. <h2 id="other-matrix-factorization-methods">Other Matrix Factorization Methods</h2> <h3 id="local-low-rank-matrix-approximation">Local Low-Rank Matrix Approximation</h3> <h2 id="understanding-on-graph-with-matrix-factorization">Understanding on graph with matrix factorization</h2> <h3 id="simplication-of-graph-convolutional-networks-a-matrix-factorization-based-perpective">Simplication of Graph Convolutional Networks: A Matrix Factorization-based Perpective</h3> The motivation of this paper is to connect the matrix factorization based graph embedding method with GNN. In this way, it does not need to load the whole graph at once, but can use the sample to get the embedding of each node. However, this paper has a very important drawback which that, there is no discussion on the feature space. The only input is the graph structure. This paqper aim to analysze the connection between GCN and MF, simply GCN with MF only, anduise unitization and cotrain to learn a node classification model. Analysis is done in the last layer: As the original GCN can be written as $\mathbf{H}^{(-1)}=\sigma\left(\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{H}^{(-1)} \mathbf{W}^{(-1)}\right)$ where $\mathbf{H}^{(-1)}$ is the hidden representation on the last layer. To write it in a node-wise method. $h_{i}^{(-1)}=\sum_{j \in I} \frac{1}{\sqrt{\left(d_{i}+1\right)\left(d_{j}+1\right)}} \mathbf{A}_{i, j} h_{j}^{(-1)}+\frac{1}{d_{i}+1} h_{i}^{(-1)}$ Notice that, here something trival happens that for the last layer represnetation $h_{i}^{(-1)}$ should not be the same from the left hand and the right hand. Then it can be rewritten as: $h_{i}^{(-1)}=\sum_{j \in I} \frac{1}{d_i}\sqrt{\frac{\left(d_{j}+1\right)}{\left(d_{i}+1\right)}}A_{i,j}h_j^{(-1)}$ Then the distance function that GCN tries to optimize becomes: $l_{s}=\sum_{i \in I} \text { Cosine }\left(h_{i}^{(-1)}, \sum_{j \in I} \frac{1}{d_{i}} \sqrt{\frac{d_{i}+1}{d_{j}+1}} \mathbf{A}_{i, j} h_{j}^{(-1)}\right)$ Then if we choose the negative sample randomly, the optimal representation canb written as $VV^T = log(|G|D^{(-1)}AD^{(-1)})-log(k)$ I do not think this form is very meanful since the form is the same with the LINE, what is the difference then. Another question is that, for the node embedding method, does it matter to have only one embedding or two embedding? The answer is that it does not matter a lot. Also, I think one thing important is that we need to clarify the difference and connection with the Graph Embedding method. </article> <article> <h1>A review on deeper GNN</h1> 2022-02-18T00:00:00+00:00 <h1 id="a-review-on-problematic-deeper-gnn">A review on problematic deeper GNN</h1> Node classification is the most well-known topic on graph domain which aims to distuiguish the type of each node on graph. In this field, people also study much fundmental limitation on GNN. The main challenge is that we believe that GNN will be more powerful with more layers and more parameters. For example, it is easy for build a CNN with more than 100 layers. However, GNN always can not. To build deeper GNN with more parameters, people try to understand ane explain this problem and give some solutions. We aim to answer the folowing research questions in deeper GNN: <ul> <li>How does GNN really help? Deep understanding on GNN</li> <li>Problems on GNN: overfitting? Gradient Explosion? gradient vanishment? oversmooth? oversquash?</li> <li>Why GNN fails because of oversmoothness? From both empirical and theoretical perspectives with some solution.</li> </ul> Word at the from, this part view is somehow more difficult than my early review in graph classification, heterphily graph and domain adaptation with many advanced topics on GNN. I will try my best to understand and write about it. This will not be the last version of this blog. I aim to go beyond above it. <h2 id="how-does-gnn-really-help-deep-understanding-on-gnn">How does GNN really help? Deep understanding on GNN</h2> TODO <h2 id="problems-on-gnn-overfitting-gradient-explosion-gradient-vanishment-oversmooth-oversquash">Problems on GNN: overfitting? gradient explosion? gradient vanishment? oversmooth? oversquash?</h2> In the problematic deeper GNN, various problem has been proposed, various paper has proposed different problems. We will first introduce them quickly. <ul> <li>Gradient explosion and vanishment are two common problems in DNN, which cause overfitting and failure of training. Welcome to see it in my another blog <a href="https://huanhuqueyue.github.io/personal-page/categories/neuronCampaign/">[link]</a>. little new understand on GNN from this perspective.</li> <li>Overfitting is also a really common problem in DNN. The main phenomenon is: high train accuracy and low test accuracy with large accuracy gap. Since it is highly related with generalization, some new understanding on transductive setting are provided.</li> <li>Oversmooth is a new problem in GNN: node representations become indistinguishable when the number of layers increases. An intuition understanding is that with aggregate too much neighborhood nodes (entire the whole graph), each node representation is aggregated from whole graph which results in distuiguishable.</li> <li>Oversquash is a new problem reference from RNN. Information from the exponentially-growing receptive field is compressed into fixed-length node vectors. It will cause that GNN only focus on overfitting the local-neighbors without considering the new exponentially aggregated node features from further hop. From my perspective, I think it is on the oppsite to oversmoothness problem for it admits different aggregation with various level.</li> <li>(Model degredation is a too confused phase, hard for me to understand it.)</li> </ul> Among them, oversmooth is the main focus recently. Based on this, the following contents will build on the following perspectives: <ul> <li>The fundamental theoretical understanding on oversmooth</li> <li>Spatial: understanding and solution inspired by PageRank</li> <li>Spectral: understanding and solution inspired by graph signa</li> <li>Understand GNN as a recursive boosting procedure</li> <li>Dynamical system: understanding like a continous function with PDE solution</li> <li>Advanced operation <ul> <li>additional connection on model architecture</li> <li>normalization regularization trick</li> </ul> </li> <li>Advanced analysis and rethinking on oversmoothness</li> </ul> <h3 id="the-fundamental-theory-understanding-on-oversmooth">The fundamental theory understanding on oversmooth</h3> In this section, we mainly focus on two most widely used theory understanding on oversmooth problem with and without considering the non-linear activation function. <h4 id="deeper-insights-into-graph-convolutional-networks-for-semi-supervised-learning">Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning</h4> Suppose that a graph $\mathcal{G}$ has $k$ connected components <code class="language-plaintext highlighter-rouge">$\{C_i \}_{i=1}^k$</code> and the indication vector for the <code class="language-plaintext highlighter-rouge">$i$</code>-th components is denoted by $1^{(i)}\in \mathbb{R}^n$, This vector indicates whether a vertex is in the component $C_i$ $\mathbf{1}_{j}^{(i)}=\left\{\begin{array}{l} 1, v_{j} \in C_{i} \\ 0, v_{j} \notin C_{i} \end{array}\right.$ Theorem 1 If a graph has no bipartite components, then for any $w\in \mathbb{R}^n$ and $\alpha \in (0,1]$ $\begin{array}{l} \lim _{m \rightarrow+\infty}\left(I-\alpha L_{r w}\right)^{m} \mathbf{w}=\left[\mathbf{1}^{(1)}, \mathbf{1}^{(2)}, \ldots, \mathbf{1}^{(k)}\right] \theta_{1} \\ \lim _{m \rightarrow+\infty}\left(I-\alpha L_{s y m}\right)^{m} \mathbf{w}=D^{-\frac{1}{2}}\left[\mathbf{1}^{(1)}, \mathbf{1}^{(2)}, \ldots, \mathbf{1}^{(k)}\right] \theta_{2}, \end{array}$ where $\theta_1 \in \mathbb{R}^k$, $\theta_2 \in \mathbb{R}^k$, i.e. they converge to a linear combination of ${1^{(i)}}^k_{i=1}$ and ${D^{\frac{1}{2}}1^{(i)}}^k_{i=1}$ respectively, which corresponds to the eigenspace to eigenvalue 0. The proof understanding are as follows: <ul> <li>The polynomial of a matrix can be written as $L^m = U \Lambda^m U^T$ which corresponds to the polynomial on the eigenvalue.</li> <li>With no bipartite components, the eigenvalue of $\mathcal{L}$ falls in $[0,2)$ (如果有时间去看一看原因</li> <li>Then the eigenvalue of $I-\alpha\mathcal{L}$ falls in $(-2\alpha I, 1]$</li> <li>With $\alpha \in [0,1]$ (a balance between the current node and neighbor node), the eigenvalue is less than 1 except for the eigenvector with all $1$ correponsd to eigenvalue 1</li> <li>therefore, with the only remaining eigenvalue, the proof is easy to see.</li> </ul> We see that without consideration on the the transformation part, the information will lose until only degree and the components information. Let $\lambda_2$ denote the second largest eigenvalue of transition matrix $\tilde{T} = D^{−1}A$ of a non-bipartite graph, $p(t)$ be the probability distribution vector and $\pi$ the stationary distribution. If walk starts from the vertex $i$, $p_i(0) = 1$, then after $t$ steps for every vertex, we have: $\left|p_{j}(t)-\pi_{j}\right| \leq \sqrt{\frac{d_{j}}{d_{i}}} \lambda_{2}^{t}$ TODO: check this theory, read GDC (上面的证明是哪篇了，我咋给忘了) <h4 id="graph-neural-networks-exponentially-lose-expressive-power-for-node-classification">GRAPH NEURAL NETWORKS EXPONENTIALLY LOSE EXPRESSIVE POWER FOR NODE CLASSIFICATION</h4> This paper consider the expressivity of GNN, a fundamental topic on deep learning, as we all know that the two-layer MLP has the expressive ability of any non-linear functions. With consideration on the non-linear transformation, this paper find that as the layer size goes infinite, the output exponentially falls into the set of signal carrying information of connected component and node degree (A subspace that is invariant under the dynamics) (same with the upper one). The key assumption is that weights on the non-linear transformation satisfy the conditions determined by the spectra of the (augmented) normalized Laplacian The speed approximate the invariant spce is $O((s\lambda)^L)$ where $s$ is the largest singular value of the matrix $W$, $\lambda$ corresponds to the eigenvalue of the laplacian matrix. <h5 id="key-notation">Key notation</h5> For a linear operator <code class="language-plaintext highlighter-rouge">$P:\mathbb{R}^N\to \mathbb{R}^M$</code> and a subset $V \subset \mathbb{R}^N$, we denote the restriction of $P$ to $V$ by $P|_V$ Let $P \in \mathbb{R}^{N \times N}$ be the symmetric adjacent matrix For $M \le N$, let $U$ be a $M$-dimensional subspace of $\mathbb{R}^N$. If $U$ is the eigenvector subspace of GNN, it has the following assumption <ul> <li>$U$ has an orthonormal basis $(e_m)_{m\in[M]}$ that consists of non-negative vectors.</li> <li>$U$ is invariant under $P$, i.e., if $u \in U$, then $Pu \in U$. Also the orthogonal complement of $U$ called $U^\perp$ has also the same property.</li> </ul> Then the linear mapping with constraint can be written $P|_{U^\perp}: U^\perp \to U^\perp$. The operator norm $\lambda$ of $P|_{U^\perp}$ is equal to $\lambda = sup_\mu|g(\mu)|$ where $g$ is the polynomial. The subspace $\mathcal{M}\in \mathbb{R}^{N\times C}$ by a basis and a vector as: $\mathcal{M}:=U \otimes \mathbb{R}^{C}=\left\{\sum_{m=1}^{M} e_{m} \otimes w_{m} \mid w_{m} \in \mathbb{R}^{C}\right\}$ The distance between a vector representation and the subspace is: $d_{\mathcal{M}}(X):=\inf \left\{\|X-Y\|_{\mathrm{F}} \mid Y \in \mathcal{M}\right\}$ which is the closest Frounbies norm to the subspace. The maximum singular value of non-linear transform $W_{lh}$ is denoted by $s_l = \prod_{h=1}^Hs_{lh}$ 一个问题：为什么要在一个subspace上看这个问题呢 <h5 id="theorem">Theorem</h5> \[d_{\mathcal{M}}\left(f_{l}(X)\right) \leq s_{l} \lambda d_{\mathcal{M}}(X)\] for any $X\in \mathbb{R}^{N\times C}$ which non-linear operation $\sigma$ decreases the distance $d_{\mathcal{M}}$ This theorem can be proved from three basic lemma Lemma1 $d_{\mathcal{M}}\left(PX\right) \leq \lambda d_{\mathcal{M}}(X)$ Give a subspace $\mathcal{M} \in (e_m)_{m\in[M]}$, and any vector $X \in \mathbb{R}^{N\times C}$ can be written as <code class="language-plaintext highlighter-rouge">$X=\sum_{m=1}^{N} e_{m} \otimes w_{m}$</code>. Given the distance to the subspace <code class="language-plaintext highlighter-rouge">$d^2_{\mathcal{M}}(X)=\sum^N_{m=M+1}||w_m||^2$</code>. Then $PX$ can be written as $\begin{aligned} P X &=\sum_{m=1}^{N} P e_{m} \otimes w_{m} \\ &=\sum_{m=1}^{M} P e_{m} \otimes w_{m}+\sum_{m=M+1}^{N} P e_{m} \otimes w_{m} \\ &=\sum_{m=1}^{M} P e_{m} \otimes w_{m}+\sum_{m=M+1}^{N} e_{m} \otimes\left(\lambda_{m} w_{m}\right) \end{aligned}$ (The first tem will becomes 0 after the minimal mapping) The second term can be rewritten as a linear combination of the eigenvectors. then the distance can be writtern as $\begin{aligned} d_{\mathcal{M}}^{2}(P X) &=\sum_{m=M+1}^{N}\left\|\lambda_{m} w_{m}\right\|^{2} \\ & \leq \lambda^{2} \sum_{m=M+1}^{N}\left\|w_{m}\right\|^{2} \\ & \leq \lambda^{2} \sum_{m=M+1}^{N}\left\|w_{m}\right\|^{2} \\ &=\lambda^{2} d_{\mathcal{M}}^{2}(X) \end{aligned}$ where $\lambda$ is the supermum of the $\lambda$. Lemma 2 $d_{\mathcal{M}}\left(XW_{lh}\right) \leq s_{lh} d_{\mathcal{M}}(X)$ The prove is the same with the first lemma, whether the matrix $W$ or $P$ comes from right or left side not matters a lot for $P$ is a symmetric matrix. Lemma 3 $d_{\mathcal{M}}\left(\sigma(X)\right) \leq d_{\mathcal{M}}(X)$ The proof of lemma three is different than the first two, for the activation function is element-wise, not vector-wise. First, we need to change the expression of the basis from both the node number $n$ and the $d$ dimension size. Let $(e_c’)_{c\in [C]}$ be the standard basis of $\mathbb{R}^C$. The norm is more like a matrix form: <code class="language-plaintext highlighter-rouge">$(e_n \otimes e_c')_{c\in [C], n\in [N]}$</code>. any $X$ can be decoupled into: \[X = \sum_{n=1}^N \sum_{c=1}^C a_{nc}e_n \otimes e_c'\] Then $\begin{aligned} d_{\mathcal{M}}^{2}(X) &=\sum_{n=M+1}^{N}\left\|\sum_{c=1}^{C} a_{n c} e_{c}^{\prime}\right\|^2 \\ &=\sum_{n=M+1}^{N} \sum_{c=1}^{C} a_{n c}^{2} \\ &=\sum_{c=1}^{C}\left(\sum_{n=1}^{N} a_{n c}^{2}-\sum_{n=1}^{M} a_{n c}^{2}\right) \\ &=\sum_{c=1}^{C}\left(\left\|X_{\cdot}\right\|^{2}-\sum_{n=1}^{M}\left\langle X_{\cdot c}, e_{n}\right\rangle^{2}\right) \end{aligned}$ The distance can be written as $d_{\mathcal{M}}^{2}(\sigma(X) = \sum_{c=1}^{C}\left(\left|X_{\cdot c}^+\right|^{2}-\sum_{n=1}^{M}\left\langle X_{\cdot c}^+, e_{n}\right\rangle^{2}\right)$ TODO: add the proof this part. Then for GNN with $\mathcal{M}$ as the eigenvector to the largest eigenvalues, if will falls exponentially into the eigenspace when $s\lambda < 1$\ <h3 id="spatial-understanding-and-solution-inspired-by-pagerank">Spatial: understanding and solution inspired by PageRank</h3> <h4 id="inspiration-from-ml-method">Inspiration from ML method</h4> Actually, the former theorem also can somehow be viewed as the extension of the pagerank (markov) problem. The fundmental theory is: any Markov process on finite states converges to a unique distribution (equilibrium) (stationary distribution) if it is irreducible and aperiodic. <ul> <li> A markov chain can be describe with the initial distribution $\pi_0$ corresponds to the state space $S$. Each step will transition according to the current step with the Probability transition matrix $P\in \mathbb{R}^{n\times n}$. </li> <li> Stationary distribution means reach a unchanged state $\tilde{\pi} = \tilde{\pi} P$ </li> <li> A markov chain can have: 0, 1, $\infty$ stationary distribution, to keep a unique one, the following properties should be satisified <ul> <li>irreducible: in any step, $P(X_t=i|X_0=j) > 0$. All nodes can be reached</li> <li>aperiodic: no derminstic movement: after $t$ steps, particular node will return to its position</li> <li>positive recurrent: all nodes can be reached no matter which node the process originally starts with.</li> </ul> </li> </ul> PageRank with random walk is an algorithm with markov proptery on graph, we will detail it on different versions in other blog. <h4 id="jknet-representation-learning-on-graphs-with-jumping-knowledge-networks">JKNet: Representation Learning on Graphs with Jumping Knowledge Networks</h4> <h5 id="analysis">Analysis</h5> The range of “neighboring” nodes that a node’s representation draws from strongly depends on the graph structure, analogous to the spread of a random walk. The basic analysis tool is the sensitivity analysis (influence distribution) inspired by page rank. The motivation is that the influence from different nodes will heavily affected by the graph structure. For example, With the same step but node with different space position, the reachable node neighbor has significant differently. Differences makes us to think that whether large or small neighborhood is good. The answer is neither. <ul> <li>Too much neighbor with higher-order features where some of the information may be “washed out” via averaging</li> <li>Less neighbor is less informative with not much information</li> </ul> What we need is the changable locality to different nodes. To quantify how the neighbor influence the other nodes, influence distribution is proposed, which gives insight into how large a neighborhood a node is drawing information from. The influence distribution is defined as: For a simple graph $G = (V, E)$, let <code class="language-plaintext highlighter-rouge">$h^{(0)}_x$</code> be the input feature and <code class="language-plaintext highlighter-rouge">$h^{(k)}_x$</code> be the learned hidden feature of node $x \in V $at the k-th (last) layer of the model. The influence score $I(x, y)$ of node $x$ by any node $y \in V$ is the sum of the absolute values of the entries of the Jacobian matrix $\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}$ . We define the influence distribution $I_x$ of $x \in V$ by normalizing the influence scores: $I_x(y)=I(x,y)/ \sum_z I(x, z)$, or $I_{x}(y)=e^{T}\left[\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}\right] e /\left(\sum_{z \in V} e^{T}\left[\frac{\partial h_{x}^{(k)}}{\partial h_{z}^{(0)}}\right] e\right)$ where $e$ is the all-ones vector. **The finding is ** Influence distributions of common aggregation schemes are closely connected to random walk distribution, which has a limitation(stationary) distribution (graph is non-bipartite). **Theorem ** Given a $k$-layer GCN with averaging aggregation, assume that all paths in the computation graph of the model are activated with the same probability of success $\rho$. Then the influence distribution $I_x$ for any node $x \in V$ is equivalent, in expectation, to the $k$-step random walk distribution on $\tilde{G}$ starting at node $x$. It is proved by <ul> <li> The one-step differentiate step can be described with non-linear activation mark, degree, weight. Then $\begin{aligned} \frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}} &=\sum_{p=1}^{\Psi}\left[\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}\right]_{p} \\ &=\sum_{p=1}^{\Psi} \prod_{l=k}^{1} \frac{1}{\widetilde{\operatorname{deg}}\left(v_{p}^{l}\right)} \cdot \operatorname{diag}\left(1_{f_{v_{p}^{l}}^{(l)}>0}\right) \cdot W_{l} \end{aligned}$ where $\Psi$ is the total number of paths, and the $\prod$ computed on each node on the path. </li> <li> Then for a single node, it can be rewritten as: $\left[\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}\right]_{p}^{(i, j)}=\prod_{l=k}^{1} \frac{1}{\tilde{\operatorname{deg}}\left(v_{p}^{l}\right)} \sum_{q=1}^{\Phi} Z_{q} \prod_{l=k}^{1} w_{q}^{(l)}$ $Z_q$ is the probablity of activation or not. The simplification is made over here. The assumption is that The activation is a probability with no relation to the weight and input, just a prob. Then the non-linear can be easily through aways as $\mathbb{E}\left [ \left[\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}\right]_{p} \right ] = \rho \cdot \prod^1_{l=k}W_l \cdot \left(\sum_{p=1}^{\Psi} \prod_{l=k}^{1} \frac{1}{\widetilde{\operatorname{deg}}\left(v_{p}^{l}\right)}\right)$ The random probablity is just the last term. Actually, aggregation is just a random walk form. </li> <li> The distribution will be a little change with GCN symmetric form, beening normalized by $(\widetilde{\operatorname{deg}}(x)\widetilde{\operatorname{deg}}(y))^{-\frac{1}{2}}$ </li> </ul> W在这里会起到什么作用 Then we can unify GCN with the random walk, both of them will share a same stationary distribution. <h5 id="method">Method</h5> Adapt to different local neighborhood range to enable to better adapt structure-aware representations. The model is very simple: <ul> <li>concate hidden representation from different layer with different range of neighbor</li> <li>readout: Maxpooling, LSTM attention</li> </ul> It is to determine the importance of different ranges after looking on all of them. Maxpooling will find the suitable layer with the maximum influences <h4 id="appnp-predict-then-propagate-graph-neural-networks-meet-personalized-pagerank">APPNP: PREDICT THEN PROPAGATE: GRAPH NEURAL NETWORKS MEET PERSONALIZED PAGERANK</h4> Inspired by the JKNet, which gives an unifying view of random walk (pagerank). The random walk will result in a stationary distribution (oversmooth) regardless of which node starts with. (need to check what the stationary distribution is.) Thus, to remain the connection to the original node, it is natural to use the personal pagerank which gives a chance to return to the root node. To preserve locality and avoid oversmooth. This allows network with more large range of neighborhoods. The personal pagerank takes the form: <code class="language-plaintext highlighter-rouge">$\boldsymbol{\pi}_{\mathrm{ppr}}\left(\boldsymbol{i}_{x}\right)=(1-\alpha) \tilde{\boldsymbol{A}} \boldsymbol{\pi}_{\mathrm{ppr}}\left(\boldsymbol{i}_{x}\right)+\alpha \boldsymbol{i}_{x}$</code>. The solution will be: $\pi_{ppr}(i_x) = \alpha(I_n - (1-\alpha)\hat{\tilde{A}})^{-1}i_x$ With the stationary distribution, the stationary hidden representation can be written as: $Z_{APPNP} = \text{softmax}\left(\alpha(I_n - (1-\alpha)\hat{\tilde{A}})^{-1}H\right)$ where $H = f_\theta(X)$. Naturally, the transformation is seperate from the aggregation. this allows us to achieve a much higher range without changing the neural network(possible benefit will be discussed in later paper) However, $\pi_{ppr}$ is a quite dense matrix which is computational expensive. An approximate version: topic-sensitive PageRank via power iteration. $\begin{aligned} \boldsymbol{Z}^{(0)} &=\boldsymbol{H}=f_{\theta}(\boldsymbol{X}) \\ \boldsymbol{Z}^{(k+1)} &=(1-\alpha) \hat{\tilde{A}} \boldsymbol{Z}^{(k)}+\alpha \boldsymbol{H} \\ \boldsymbol{Z}^{(K)} &=\operatorname{softmax}\left((1-\alpha) \hat{\tilde{A}} \boldsymbol{Z}^{(K-1)}+\alpha \boldsymbol{H}\right) \end{aligned}$ <h4 id="gprgnn-adaptive-universal-generalized-pagerank-graph-neural-network">GPRGNN: ADAPTIVE UNIVERSAL GENERALIZED PAGERANK GRAPH NEURAL NETWORK</h4> As JKNet propose the maxp pooling function to select on different layers, GPRGNN uses a learnable parameters on generalized pagerank for the layer selection. 原来的GPRGNN是咋优化的，这个优化不是很好理解。 <h5 id="what-advances-in-generalized-pagerank">What advances in Generalized pagerank</h5> It is first proposed for graph clustering. The GPR takes the form as: <code class="language-plaintext highlighter-rouge">$\sum_{k=0}^{\infty} \gamma_{k} \tilde{\mathbf{A}}_{\mathrm{sym}}^{k} \mathbf{H}^{(0)}=\sum_{k=0}^{\infty} \gamma_{k} \mathbf{H}^{(k)}$</code>. Clustering of the graph is performed locally by thresholding the GPR score. Other pagerank can be viewed as a specific choice of GPR. APPNP can be viewed as fixed $\gamma_k = \alpha(1-\alpha)^k$ The learnable $\gamma_{k} $ gives model ability to learn long or short range information adaptively. The final form is similar with APPNP: \[\begin{aligned} \boldsymbol{H}^{(0)} &=\boldsymbol{H}=f_{\theta}(\boldsymbol{X}) \\ \boldsymbol{H}^{k} &=\hat{\tilde{A}} \boldsymbol{H}^{k-1} \\ \boldsymbol{H} &= \sum_{k=0}^K\gamma_k\boldsymbol{H}^k\\ \boldsymbol{\hat{P}} &=\operatorname{softmax}\left(\boldsymbol{Z}\right) \end{aligned}\] We can see that different graphs appears differently, while the heterophily graph requires more information from the further neighborhoods. <h5 id="theory-properties">Theory properties:</h5> The filter of GPRGNN is: $g_{\gamma, K}(\lambda)=\sum_{k=0}^{K} \gamma_{k} \lambda^{k}$ Assume $\sum \gamma = 1$ and $\gamma$ can be a minus number. if $\gamma > 0$, low-frequency filter, if $\gamma < 0$, high-frequency filter. lemma 1看了 TODO: lemma 2 Assume the graph $G$ is connected and the training set contains nodes from each of the classes. Also assume that $k’$ is large enough so that the over-smoothing effect occurs for $H(k) , \forall k \ge k’$ which dominate the contribution to the final output Z. Then, the gradients of $\gamma$ and $\gamma$ are identical in sign for all $k \ge k’$ . It means that when oversmooth happens, $\gamma_k$ will be 0 <h3 id="spectral-understanding-and-solution-inspired-by-graph-signal">Spectral: understanding and solution inspired by graph signal</h3> <h4 id="fundamental-knowledge-in-graph-signal-process">Fundamental knowledge in graph signal process</h4> A vector $x\in \mathbb{R}^n$ defined on the vertices of the graph is the graphs signal. The basic operations are: <ul> <li>variation $\Delta$: $\mathbb{R}^n\to \mathbb{R}$ $\Delta(x) = \sum_{(i,j)\in \mathcal{E}}(x(i)-x(j))^2 = x^tLx$ measure the smoothness</li> <li>$\tilde{D}$-inner product: $(x,y){\tilde{D}} = \sum{i\in \mathcal{V}}(d(i)+\gamma)x(i)y(i) = x^T\tilde{D}y$ measure the importance of the signal(put more weight on high degree nodes)</li> </ul> The general form of graph signal as: $\min{\Delta(u)} \text{ subject to } (u, u)_{\tilde{D}}=1, (u, u_j)_{\tilde{D}}=1, j \in \{1, \cdots, n\}$ The solution will be: $Lu = \lambda \tilde{D}u$ The generalized eigenvalue corresponds to the graph signal. The fourier base is defined as: <ul> <li>fourier transform $Fx = \hat{x} = U^T\tilde{D}x$</li> <li>inverse fourier transform $F^{-1}\hat{x} = U\hat{x}$</li> <li>Graph filter: $\hat{y}(\lambda) = h(\lambda)\hat{x}(\lambda)$ which is equal to $y=h(\tilde{L}_{rw})x$ $h$ is the tylor expansion of $h$</li> <li>The noise-to-signal ratio is defined as $\frac{||Z||_D}{||X||_D}$</li> </ul> <h4 id="revisiting-graph-neural-networks-all-we-have-is-low-pass-filters">Revisiting Graph Neural Networks: All We Have is Low-Pass Filters</h4> With these fundamental information, revisiting GNN from graphs signal process. The answer is with informative feature, GNN only perform low-pass filter for denoising without any non-linear propoerty. The motivation or this paper is： Why and when do graph neural networks work well for vertex classification? <ul> <li>Is there a condition GNN can work even without training?</li> <li>Is there a condition GNN cannot work well?</li> </ul> <h5 id="assumption-1-input-features-consist-of-low-frequency-true-features-and-noise-the-true-features-have-sufficient-information-for-the-machine-learning-task">Assumption 1: Input features consist of low-frequency true features and noise. The true features have sufficient information for the machine learning task.</h5> The experiment verify with different noise level with different frequency on MLP. <ul> <li>compute Fourier basis $U$</li> <li>Add gaussian noise to input features</li> <li>compute first $k$-component $\hat{X}_k=U[: k]^T\tilde{D}^{\frac{1}{2}}X$</li> <li> reconstruct feature $\hat{X}_k=\tilde{D}^{-\frac{1}{2}}U[: k]^T\hat{X}_k$ </li> <li>Train MLP on the new feature with different frequencies</li> </ul> TODO: add more experimental explaination GNN can provide low frequency smooth data. <h5 id="theory-bias-variance-understanding-on-gnn">Theory: bias-variance understanding on GNN</h5> TODO: some review on the complexity and generalization With the assumption that faeture is composed of the true feature $\bar{x}$ and noise $z(i)$, we have Lemma 5 Suppose Assumption 4. For any $0 < \delta < 1/2$, with probability at least $1 − \delta$, we have $\left\|\bar{X}-\tilde{A}_{r w}^{k} X\right\|_{D} \leq \sqrt{k \epsilon}\|\bar{X}\|_{D}+O(\sqrt{\log (1 / \delta) R(2 k)}) \mathbb{E}\left[\|Z\|_{D}\right]$ where $R(2k)$ is a probability that a random walk with a random initial vertex returns to the initial vertex after $2k$ steps. The first and second terms are bias induced by filter and variance from the original noise. Bias will increase with more hops adjacent matrix with a speed of $O(\sqrt{\epsilon})$. where the variance will decrease like $O(1/deg^{k/2})$ Then the optimial $k$ is that: Suppose that <code class="language-plaintext highlighter-rouge">$\mathbb{E}[||Z||_D] \le \rho ||\bar{X}||_D$</code> for some $\rho = O(1)$. Let <code class="language-plaintext highlighter-rouge">$k^*$</code> be defined by <code class="language-plaintext highlighter-rouge">$k^*$</code> = $O(log(log(1/δ)\rho/\epsilon))$, and suppose that there exist constants $C_d$ and $\bar{d} > 1$ such that $R(2k) ≤ C_d/ \bar{d}^k$ for <code class="language-plaintext highlighter-rouge">$k \le k^*$</code> . Then, by choosing $k = k^*$ , the right-hand side of (6) is <code class="language-plaintext highlighter-rouge">$\tilde{O}( \sqrt{\epsilon})$</code>. TODO, find the prove understanding on GNN: <ul> <li>GCN may falls on overfitting the intermediate representation</li> <li>SGC is similar to MLP with true feature</li> </ul> <h4 id="scattering-gcn-overcoming-oversmoothness-in-graph-convolutional-networks">Scattering GCN: Overcoming Oversmoothness in Graph Convolutional Networks</h4> Similar with the heterphoily jobs, it uses the band-pass filtering of graph signal for the low-pass signal only consider the local activation patterns. The neural pathway encode higher-order forms of regularity in graphs, with higher signal. <h5 id="geometric-scattering">Geometric scattering</h5> is defined by the lazy random walk matrix: $P=\frac{1}{2}(I_n+WD^{-1})$ where $x_t = P^Tx$ is the low frequency in the Geometric GNN. The wavelet is then defined as: $\left\{\begin{array}{l} \boldsymbol{\Psi}_{0}:=\boldsymbol{I}_{n}-\boldsymbol{P} \\ \boldsymbol{\Psi}_{k}:=\boldsymbol{P}^{2^{k-1}}-\boldsymbol{P}^{2^{k}}=\boldsymbol{P}^{2^{k-1}}\left(\boldsymbol{I}_{n}-\boldsymbol{P}^{2^{k-1}}\right), \quad k \geq 1 \end{array}\right.$ The geometric is defined as: $U_px=\boldsymbol{\Psi}_{k_m}|\boldsymbol{\Psi}_{k_{m-1}}|\boldsymbol{\Psi}_{k_1}x||$ which is stack of the element-wise absolute value non-linearity. Then all features are combined together as: residual connection with a cutoff frequency. Theory part only use some specific graph which GCN can not find but with scattering channels for better expressivity. like cyclic or bipartite. <h4 id="s2gc-simple-spectral-graph-convolution">S2GC: SIMPLE SPECTRAL GRAPH CONVOLUTION</h4> This paper tries to extract the higher frequency (self loop and seletive layer) with the modified Markov Diffusion kernel, which tries to enlarge the receptive of GNN. similar with APPNP, another explanation and solution <h5 id="markov-diffusion-kernel">Markov diffusion kernel</h5> It is similar to the shortest path kernel we introduced before, which focuses on the co-occurrance on a markov chain. $d_{i j}(K)=\left\|\mathbf{Z}(K)\left(\mathbf{x}_{i}(0)-\mathbf{x}_{j}(0)\right)\right\|_{2}^{2}$ where $Z(K)=\frac{1}{K}\sum_{k=1}^KT^k$, $T$ is the transition matrix (adjacent) $T=A’ = (D + I)^{-1/2} ( A + I ) (D + I)^{-1/2}$ <h5 id="method-1">Method</h5> It can be simply reduce to the form $\hat{Y}=\text{softmax}(\frac{1}{K}\sum_{k=0}^K\tilde{T}^kXW)$ with the Laplacian regularization as: $\min{ h^TLh +\frac{1}{2}||h_i - x_i||_2^2} = \min{\frac{1}{2}\left(\sum_{i, j=1}^{n} \widetilde{\mathbf{A}}_{i j}\left\|\frac{\mathbf{h}_{i}}{\sqrt{d_{i}}}-\frac{\mathbf{h}_{j}}{\sqrt{d_{j}}}\right\|_{2}^{2}\right)+\frac{1}{2}\left(\sum_{i=1}^{n}\left\|\mathbf{h}_{i}-\mathbf{x}_{i}\right\|_{2}^{2}\right)}$ Then add the self-loop as: $\hat{Y}=\operatorname{softmax}\left(\frac{1}{K} \sum_{k=1}^{K}\left((1-\alpha) \widetilde{\mathbf{T}}^{k} \mathbf{X}+\alpha \mathbf{X}\right) \mathbf{W}\right)$ <h5 id="theory-analysis">Theory analysis</h5> Theorem 1 $N(\tilde{T}^0)\subseteq N(\tilde{T}^0)\subseteq N(\tilde{T}^1)\subseteq N \cdots(\tilde{T}^0)$ smaller neighbor belongs to the larger neighborhoods Theorem 2 the energy of infinite-dimensional receptive field (largest k) will not dominate the sum energy of our filter. (different pespective from oversquash) TODO reading <h3 id="understand-gnn-as-a-recursive-procedure">Understand GNN as a recursive procedure</h3> GNN stacks different orders of neighborhood sequentially which first aggregation the first order neighbor, then the second in an recursive way as following Then the focus on GNN is the drawback on this procedure, and what is the best way to learn from the multi-hop neighborhood. <h4 id="oversquash-on-the-bottleneck-of-graph-neural-networks-and-its-practical-implications">Oversquash: ON THE BOTTLENECK OF GRAPH NEURAL NETWORKS AND ITS PRACTICAL IMPLICATIONS</h4> Oversquash is also a problem in RNN, which increasing growth information is referred into a fixed-size representation space. GNN will have exponential propogation messgae which may fails from distant nodes and build the long-range dependence. experiments shows that GNN always underfitting on the condition that fitting the tree-structure graph. Empicially, GNN will overfiting short-range signal rather than the long-range information squashed in the bottleneck. The solution is that: add a direction between two nodes. An easy solution is to build a fully-connected GNN layer. (This is also the reason why graphormer can help) other ablation finds that: larger hidden dimension do not have significant improvement. Even half fully-connnected can help a lot. All directed interaction is not neccerary needed without graph structure. <h4 id="adagcn-adaboosting-graph-convolutional-networks-into-deep-models">ADAGCN: ADABOOSTING GRAPH CONVOLUTIONAL NETWORKS INTO DEEP MODELS</h4> On this sequencial behavior, Adaboost is a good solution for the sequential relationship between different orders. To use a RNN-like GCN with iterative updating of the node weights. our AdaGCN also follows this direction by choosing an appropriate f in each layer rather than directly deepen GCN layers The base classifier is designed as: $Z^l = f_\theta(\hat{A}^lX)$ with only a linear transformation. adaboost is defined as: $\begin{aligned} e r r^{(l)} &=\sum_{i=1}^{n} w_{i} \mathbb{I}\left(c_{i} \neq f_{\theta}^{(l)}\left(x_{i}\right)\right) / \sum_{i=1}^{n} w_{i} \\ \alpha^{(l)} &=\log \frac{1-e r r^{(l)}}{\operatorname{err}(l)}+\log (K-1) \end{aligned}$ What does $K$ means $w_{i} \leftarrow w_{i} \cdot \exp \left(\alpha^{(l)} \cdot \mathbb{I}\left(c_{i} \neq f_{\theta}^{(l)}\left(x_{i}\right)\right)\right), i=1, \ldots, n$ The different is that $f_\theta$ is shared by different layer but with different parameter. <h3 id="advanced-operation">Advanced operation</h3> <h4 id="architectural-modifications-additional-connection-on-model-architecture">architectural modifications: additional connection on model architecture</h4> Inspired by the residual connection in CV, GNN also design specificed residual connection. JKNet and S2GC can be viewed as the Dense connection in GNN. residual connection in GNN, however, can only prevent fast performance degrade but not enhance performance. New perspective should be proposed. <h5 id="gcnii-simple-and-deep-graph-convolutional-networks">GCNII: Simple and Deep Graph Convolutional Networks</h5> It propose two simple yet effective techniques: Initial residual and Identity mapping. The final form: $\mathbf{H}^{(\ell+1)}=\sigma\left(\left(\left(1-\alpha_{\ell}\right) \tilde{\mathbf{P}} \mathbf{H}^{(\ell)}+\alpha_{\ell} \mathbf{H}^{(0)}\right)\left(\left(1-\beta_{\ell}\right) \mathbf{I}_{n}+\beta_{\ell} \mathbf{W}^{(\ell)}\right)\right)$ Initial residual is somehow similar with the feature similarity preserve. A question: if large embedding size, use a MLP layer. This is somehow different with the APPNP with linear combination, it indeed makes deep with non-linear transformation. Identity mapping is similar with res connection, but with also the influence on the non-linearity and the initial residual. Hard to find so much difference <ul> <li>It will increase the maximum singular value which will reduce $s\lambda < 1$</li> <li>small the norm of $W$, put strong regularization on $W^l$ to avoid overfiting</li> </ul> Theory part Theorem 1 Assume the self-looped graph $\tilde{G}$ is connected. Let $h^{(K)} = ( \frac{I_n+\tilde{D} {−1/2}\tilde{A}\tilde{ D}^ {−1/2}}{2})^K ·x$ denote the representation by applying a K-layer renormalized graph convolution with residual connection to a graph signal $x$. Let $\lambda \tilde{G}$ denote the spectral gap of the self-looped graph $\tilde{G}$, that is, the least nonzero eigenvalue of the normalized Laplacian $\tilde{L} = I_n − \tilde{D} ^{−1/2}\tilde{A} \tilde{D}^{ −1/2}$ . We have 1) As K goes to infinity, $h (K)$ converges to $\pi = <\frac{\tilde{D}^{1/2}1,x>} {2m+n}\cdot \tilde{D}^{1/2}1$, where 1 denotes an all-one vector. 2) The convergence rate is determined by $\mathbf{h}^{(K)}=\pi \pm\left(\sum_{i=1}^{n} x_{i}\right) \cdot\left(1-\frac{\lambda_{\tilde{G}}^{2}}{2}\right)^{K} \cdot \mathbf{1}$ The prove keys are: <ul> <li> split the origin $h^K$ with linear combination of basis (which can represents the random walk) $\tilde{\mathbf{D}}^{1 / 2} \mathbf{x}=\left(\mathbf{D}+\mathbf{I}_{n}\right)^{1 / 2} \mathbf{x}=\sum_{i=1}^{n}\left(\mathbf{x}(i) \sqrt{d_{i}+1}\right) \cdot \mathbf{e}_{\mathbf{i}}$ </li> <li> use the lemma Let $p^(K)_i = (\frac{I_n+\tilde{A} \tilde{D}^{−1}}{2})^Ke_i$ is the K-th transition probability vector from node i on connected self-looped graph $\tilde{G}$. Let $\lambda\tilde{G}$ denote the spectral gap of $\tilde{G}$. The j-th entry of $p^{(K)}_i$ can be bounded by $\left|\mathbf{p}_{i}^{(K)}(j)-\frac{d_{j}+1}{2 m+n}\right| \leq \sqrt{\frac{d_{j}+1}{d_{i}+1}}\left(1-\frac{\lambda_{\tilde{G}}^{2}}{2}\right)^{K} .$ </li> </ul> Theorem 2 Consider the self-looped graph $\tilde{G}$ and a graph signal $x$. A $K$-layer GCNII can express a $K$ order polynomial filter $\sum_{l=0}^K\theta_l\tilde{L}^l)x$ with arbitrary coefficients $\theta$ Too much assumption, not so good. <h4 id="regularization--normalization-trick">regularization & normalization trick</h4> The most two popular regularization methods are batch norm and dropout. However, dropout can not work well on the graph architecture. Batchnorm can not capture the relation between nodes. Various methods has been proposed for both enhance generalization ability, reduce overfitting, reduce oversmooth. <h5 id="dropedge-towards-deep-graph-convolutional-networks-on-node-classification">DROPEDGE: TOWARDS DEEP GRAPH CONVOLUTIONAL NETWORKS ON NODE CLASSIFICATION</h5> Dropedge is a natural extension of dropedge. Two way to view it: <ul> <li>A data augmenter: similar with dropout</li> <li>A mesage reducer: reduce some neighborhoods <ul> <li>slow down the convergence speed, the relaxed $\epsilon$-smooth will only increase.</li> <li>Smaller gap between the origin feature and the convergence subspace</li> </ul> </li> </ul> Dropedge is two-step as: <ul> <li>$A_{drop} = A - A’$</li> <li>Renormalization $A_{drop} \to \hat{A}_{drop}$</li> </ul> **proof 1 ** 证的不咋地 \[\hat{l}(\mathcal{M}, \epsilon) \le \hat{l}(\mathcal{M}', \epsilon)\] $\epsilon$ smoothness is designed as the layer with <code class="language-plaintext highlighter-rouge">$l^{*}(\mathcal{M}, \epsilon):=\min _{l}\left\{d_{\mathcal{M}}\left(\boldsymbol{H}^{(l)}\right)<\epsilon\right\}$</code> The relaxed $\epsilon$-smoothing is the upper bound of $\epsilon$-smooth as $\hat{l}(\mathcal{M}, \epsilon):= \left\lceil\frac{\log \left(\epsilon / d_{\mathcal{M}}(\boldsymbol{X})\right)}{\log s \lambda}\right\rceil$ where $s$ is the largest eigenvalue, $\lambda$ is the second largest eigenvalue of $\hat{A}$ We also need to adopt some concepts from Lovász et al. (1993) in proving Theorem 1. Consider the graph $G$ as an electrical network, where each edge represents an unit resistance. Then the effective resistance, $R_{st}$ from node $s$ to node $t$ is defined as the total resistance between node $s$ and $t$. According to Corollary 3.3 and Theorem 4.1 (i) in Lovász et al. (1993), we can build the connection between $\lambda$ and $R_{st}$ for each connected component via commute time as the following inequality. Remove any edge will cause $R_{st}$ increase, if remove into two bipatite, more dimension of information will be remained. <h5 id="grand-graph-random-neural-networks-for-semi-supervised-learning-on-graph">GRAND: Graph Random Neural Networks for Semi-Supervised Learning on Graph</h5> Take the idea of Dropedge, more tricks based on dropout also with contrastive loss is proposed. Grand has the similar advantage, a new advantage is enhance robustness. Grand is two-fold: <ul> <li> Each node features can be randomly drop either partially (dropout) or entirely (dropnode) </li> <li> decouple propogation and transformation $\bar{A}=\sum^K_{k=0}\frac{1}{K+1}\hat{A}^k$, then MLP </li> <li> An additional consistency regularization for different views. $\mathcal{L}_{\text {con }}=\frac{1}{S} \sum_{s=1}^{S} \sum_{i=0}^{n-1}\left\|\overline{\mathbf{Z}}_{i}^{\prime}-\widetilde{\mathbf{Z}}_{i}^{(s)}\right\|_{2}^{2}$ where $\bar{Z}_i’$ is the mean of different views. </li> </ul> The theorem is somehow little trivial with understanding on how these loss as the regularization term. <h5 id="pairnorm-tackling-oversmoothing-in-gnns">PAIRNORM: TACKLING OVERSMOOTHING IN GNNS</h5> From the perspective of Batchnorm, the regularization normalization on GNN is proposed, which provents all node embedding becoming too similar. The idea is somehow similar with feature similarity preservation. Analysis understanding that most GNNs perform a special form of Laplacian smoothing, which makes node features more similar to one another. The key idea is to ensure that the total pairwise feature distances remains a constant across layers, which in turn leads to distant pairs having less similar features, preventing feature mixing across clusters. Two measurements are proposed: $\begin{array}{l} \text { row-diff }\left(\mathbf{H}^{(k)}\right)=\frac{1}{n^{2}} \sum_{i, j \in[n]}\left\|\mathbf{h}_{i}^{(k)}-\mathbf{h}_{j}^{(k)}\right\|_{2} \\ \text { col-diff }\left(\mathbf{H}^{(k)}\right)=\frac{1}{d^{2}} \sum_{i, j \in[d]}\left\|\mathbf{h}_{\cdot i}^{(k)} /\right\| \mathbf{h}_{\cdot i}^{(k)}\left\|_{1}-\mathbf{h}_{\cdot j}^{(k)} /\right\| \mathbf{h}_{\cdot j}^{(k)}\left\|_{1}\right\|_{2} \end{array}$ where row-diff measure is the average of all pairwise distances between node features, quantifies node-wise oversmoothing. col-diff quantifies feature-wise smoothness. 其实感觉col diff会减小其实挺奇怪的。 Think of the reason on measurement here. The reason why row-difference changes so sharply is still under discussion. A new interpertation: graph-regularized least squares $\min _{\overline{\mathbf{X}}} \sum_{i \in \mathcal{V}}\left\|\overline{\mathbf{x}}_{i}-\mathbf{x}_{i}\right\|_{\tilde{\mathbf{D}}}^{2}+\sum_{(i, j) \in \mathcal{E}}\left\|\overline{\mathbf{x}}_{i}-\overline{\mathbf{x}}_{j}\right\|_{2}^{2}$ where $\bar{X}\in \mathbb{R}^{N\times d}$, $\left|z_i\right|{\tilde{\mathbf{D}}}^{2} = z_i^T\tilde{\mathbf{D}}z_i$, with a closed form solution $\bar{X}=(2I - \tilde{A}{rw})^{-1}X$ 其实重要的就是保护对应的表达空间。 We should not only consider smooth the same cluster, but also distant disconnected pairs $\min _{\overline{\mathbf{X}}} \sum_{i \in \mathcal{V}}\left\|\overline{\mathbf{x}}_{i}-\mathbf{x}_{i}\right\|_{\tilde{\mathbf{D}}}^{2}+\sum_{(i, j) \in \mathcal{E}}\left\|\overline{\mathbf{x}}_{i}-\overline{\mathbf{x}}_{j}\right\|_{2}^{2} - \lambda \sum_{(i, j) \notin \mathcal{E}}\left\|\overline{\mathbf{x}}_{i}-\overline{\mathbf{x}}_{j}\right\|_{2}^{2}$ The distance should keep the same in total $\sum_{(i, j) \in \mathcal{E}}\left\|\dot{\mathbf{x}}_{i}-\dot{\mathbf{x}}_{j}\right\|_{2}^{2}+\sum_{(i, j) \notin \mathcal{E}}\left\|\dot{\mathbf{x}}_{i}-\dot{\mathbf{x}}_{j}\right\|_{2}^{2}=\sum_{(i, j) \in \mathcal{E}}\left\|\mathbf{x}_{i}-\mathbf{x}_{j}\right\|_{2}^{2}+\sum_{(i, j) \notin \mathcal{E}}\left\|\mathbf{x}_{i}-\mathbf{x}_{j}\right\|_{2}^{2}$ To avoid high computional cost, the computational step is: $\operatorname{TPSD}(\tilde{\mathbf{X}})=\sum_{i, j \in[n]}\left\|\tilde{\mathbf{x}}_{i}-\tilde{\mathbf{x}}_{j}\right\|_{2}^{2}=2 n^{2}\left(\frac{1}{n} \sum_{i=1}^{n}\left\|\tilde{\mathbf{x}}_{i}\right\|_{2}^{2}-\left\|\frac{1}{n} \sum_{i=1}^{n} \tilde{\mathbf{x}}_{i}\right\|_{2}^{2}\right)$ Further simplify will be $\operatorname{TPSD}(\tilde{\mathbf{X}}) = \operatorname{TPSD}(\tilde{\mathbf{X}}^c) = 2n||\tilde{\mathbf{X}}^c||^2_F$ which $X^c = X - \bar{X}$, the center representation The final procedure are center and rescale: $\begin{array}{l} \tilde{\mathbf{x}}_{i}^{c}=\tilde{\mathbf{x}}_{i}-\frac{1}{n} \sum_{i=1}^{n} \tilde{\mathbf{x}}_{i} \\ \dot{\mathbf{x}}_{i}=s \cdot \frac{\tilde{\mathbf{x}}_{i}^{c}}{\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left\|\tilde{\mathbf{x}}_{i}^{c}\right\|_{2}^{2}}}=s \sqrt{n} \cdot \frac{\tilde{\mathbf{x}}_{i}^{c}}{\sqrt{\left\|\tilde{\mathbf{X}}^{c}\right\|_{F}^{2}}} \end{array}$ <h5 id="towards-deeper-graph-neural-networks-with-differentiable-group-normalization">Towards Deeper Graph Neural Networks with Differentiable Group Normalization</h5> GroupNorm gives us another interpretation on oversmooth, inter-class distance is smaller than intra-class distance. Moreover, it is somehow similar with the diffpooling. The main ignore of the pairnorm ignore the same group nodes without connection. Nodes within the same community/class need be similar to facilitate the classification, while different classes are expected to be separated in embedding space from a global perspectives of view. The challenges are: <ul> <li>oversmooth related to local node relation (pairnorm) but also global structures (group)</li> <li>group information is hard to get on training.</li> </ul> Analysis measurement group ratio is defined as: $R_{\mathrm{Group}}=\frac{\frac{1}{(C-1)^{2}} \sum_{i \neq j}\left(\frac{1}{\left|\boldsymbol{L}_{i} \| \boldsymbol{L}_{j}\right|} \sum_{h_{i v} \in \boldsymbol{L}_{i}} \sum_{h_{j v^{\prime}} \in \boldsymbol{L}_{j}}\left\|h_{i v}-h_{j v^{\prime}}\right\|_{2}\right)}{\frac{1}{C} \sum_{i}\left(\frac{1}{\left|\boldsymbol{L}_{i}\right|^{2}} \sum_{h_{i v}, h_{i v^{\prime}} \in \boldsymbol{L}_{i}}\left\|h_{i v}-h_{i v^{\prime}}\right\|_{2}\right)}$ Isntance information Gain as the mutual information: $G_{\text {Ins }}=I(\mathcal{X} ; \mathcal{H})=\sum_{x_{v} \in \mathcal{X}, h_{v} \in \mathcal{H}} P_{\mathcal{X H}}\left(x_{v}, h_{v}\right) \log \frac{P_{\mathcal{X} \mathcal{H}}\left(x_{v}, h_{v}\right)}{P_{\mathcal{X}}\left(x_{v}\right) P_{\mathcal{H}}\left(h_{v}\right)}$ The representation 这套文章提出来的指标都是自己解决好的指标，而不是为了寻找什么insight。 GroupNorm: normalize the node embeddings group by group. each group to be rescale to be more similar. The two steps are: <ul> <li>assign each node with learnable group $S= \text{softmax}\left (H^{(k)}U^{(k)} \right)$</li> <li>rescale on each group: $H_i^k = S^k[:, i] \circ H^k$ $H_i^k=\gamma_i(\frac{H_i^k-\mu_i}{\sigma_i})+\beta_i$</li> <li>Strength the importance of origin information $\tilde{H}^k = H^k+\lambda\sum_{i=1}^G\tilde{H}_i^k$</li> </ul> <h4 id="benchmark-bag-of-tricks-for-training-deeper-graph-neural-networks-a-comprehensive-benchmark-study">Benchmark: Bag of Tricks for Training Deeper Graph Neural Networks: A Comprehensive Benchmark Study</h4> The empirical study has the findings: <ul> <li>As we empirically show, while initial connection and jumping connection are both “beneficial” training tricks when applied alone, combining them together deteriorates deep GNN performance.</li> <li>Although dense connection brings considerable improvement on large-scale graphs with deep GNNs, it sacrifices the training stability to a severe extent.</li> <li>As another example, the gain from NodeNorm become diminishing when applied to large-scale datasets or deeper GNN backbones.</li> <li>Moreover, using random dropping techniques alone often yield unsatisfactory performance.</li> <li>Lastly, we observe that adopting initial connection and group normalization is universally effective across tens of classical graph datasets. Those findings urge more synergistic rethinking of those seminal works</li> </ul> skip connection group: <ul> <li>skip connneciton can accrelate the training</li> <li>shallow model is not suitable for skip connection (except on large dataset (maybe noise) )</li> <li>SGC benefits from skip conneciton</li> </ul> Normalization <ul> <li>NodeNorm <code class="language-plaintext highlighter-rouge">$\left(\mathbf{x}_{i} ; p\right)=\frac{\mathbf{x}_{i}}{\operatorname{std}\left(\mathbf{x}_{i}\right)^{\frac{1}{p}}}$</code></li> <li>MeanNorm <code class="language-plaintext highlighter-rouge">$\left(\mathbf{x}_{(k)}\right)=\mathbf{x}_{(k)}-\mathbb{E}\left[\mathbf{x}_{(k)}\right] $</code></li> <li>BatchNorm <code class="language-plaintext highlighter-rouge">$\left(\mathbf{x}_{(k)}\right)=\gamma \cdot \frac{\mathbf{x}_{(k)}-\mathbb{E}\left[\mathbf{x}_{(k)}\right]}{s t d\left(\mathbf{x}_{(k)}\right)}+\beta $</code></li> </ul> observation: <ul> <li>training with norm is much stable</li> <li>node norm and pairnorm perform well on small dataset, while group norm on the larger dataset.</li> </ul> Drop observation <ul> <li>dropout for shallow GNN</li> <li>drop technique suitable for random dropping</li> </ul> <h3 id="tradeoff-between-neighborhood-size-and-neural-network-depth">Tradeoff between Neighborhood size and neural network depth</h3> The difference between GNN and basic MLP is the aggregation (neighbor size). In the traditional GNN, more neighborhood means more parameter leads to overfiting. Many paper propose the decouple transformation and aggregation. So what is the key reason for oversmooth? <h4 id="dagnn-towards-deeper-graph-neural-networks">DAGNN: Towards Deeper Graph Neural Networks</h4> The key factor compromising the performance is entanglement of representation transformation and propogation.Decouple is the key component. <h5 id="analysis-quantitative-metric-for-smoothness">Analysis: Quantitative metric for smoothness</h5> \[D(x_i, x_j)=\frac{1}{2}||\frac{x_i}{||x_i||} - \frac{x_j}{||x_j||} ||\] where <code class="language-plaintext highlighter-rouge">$||\cdot||$</code> denotes the Euclidean norm The smothness score will decade quickly on well-trained GCN, however, disentangle do not comes down quickly with only linear propogation. The distangement architecture is: $\begin{aligned} Z &=\operatorname{MLP}(X) \\ X_{o u t} &=\operatorname{softmax}\left(\widehat{A}^{k} Z\right) \end{aligned}$ <h5 id="theoretically-analysis">Theoretically analysis</h5> Nothing new that $D^{-1}A$ and $D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$ will converge into a vector. <h5 id="application">Application</h5> Design on DAGNN, it utilizes an adaptive adjustment mechanism that can adaptively balance the information from local and global neighborhoods for each node $\begin{array}{ll} Z=\operatorname{MLP}(X) & \in \mathbb{R}^{n \times c} \\ H_{\ell}=\widehat{A}^{\ell} Z, \ell=1,2, \cdots, k & \in \mathbb{R}^{n \times c} \\ H=\operatorname{stack}\left(Z, H_{1}, \cdots, H_{k}\right) & \in \mathbb{R}^{n \times(k+1) \times c} \\ S=\sigma(H s) & \in \mathbb{R}^{n \times(k+1) \times 1} \\ \widetilde{S}=\operatorname{reshape}(S) & \in \mathbb{R}^{n \times 1 \times(k+1)} \\ X_{\text {out }}=\operatorname{softmax}(\text { squeeze }(\widetilde{S} H)) & \in \mathbb{R}^{n \times c} \end{array}$ where $s\in \mathbb{R}^{n \times c}$ is a projection function, $c$ is the number of classes. <h4 id="revisiting-oversmoothing-in-deep-gcns">Revisiting Oversmoothing in Deep GCNs</h4> However, another point of view is proposed (the solution is related with spectral). which the transformation layer is learn to anti-oversmooth during training. The understanding is: untrained GCN indeed oversmooth, but the learning procedure will lean to distuiguish against it. However, the model is not well-trained against the oversmoothness. This paper propose the understanding: <ul> <li> The forward procedure is optimized the smoothness Here has a learning rate analysis $\begin{aligned} \nabla_{X} &=\frac{\partial R(X)}{\partial X}=\frac{1}{2} \frac{\partial \frac{\operatorname{Tr}\left(X^{\top} \Delta X\right)}{\operatorname{Tr}\left(X^{\top} X\right)}}{\partial X}=\frac{\left(\Delta-I \frac{\operatorname{Tr}\left(X^{\top} \Delta X\right)}{\operatorname{Tr}\left(X^{\top} X\right)}\right) X}{\operatorname{Tr}\left(X^{\top} X\right)} \\ X_{m i d} &=X-\eta \nabla_{X}=\frac{(2-\Delta) X}{2-\frac{\operatorname{Tr}\left(X^{\top} \Delta X\right)}{\operatorname{Tr}\left(X^{\top} X\right)}}=\frac{\left(I+D^{-\frac{1}{2}} A D^{-\frac{1}{2}}\right)}{2-\frac{\operatorname{Tr}\left(X^{\top} \Delta X\right)}{\operatorname{Tr}\left(X^{\top} X\right)}} X \end{aligned}$ </li> <li> The backward is for the classification loss which will reduce the oversmoothness </li> </ul> TODO read the proof the connected nodes share similar representations (“similar” means the scale of each feature channel is approximately proportional to the square root of its degree) 为什么这个位置没有高阶的度数 Solution meansubtract which will approach the second large eigenvalue. <h4 id="evaluating-deep-graph-neural-networks">Evaluating Deep Graph Neural Networks</h4> The most interesting in this paper may reject the last paper perspective, which even single MLP will fall into the oversmooth problem. This paper study: <ul> <li>The root problem why deep model performance decay happens in deeper GNN (oversmoothness?)</li> <li>when and how to build deeper GNN?</li> </ul> <h5 id="experiment-setting">Experiment setting</h5> <ul> <li> smoothness measurement The stationary state $\hat{\mathrm{A}}_{i, j}^{\infty}=\frac{\left(d_{i}+1\right)^{r}\left(d_{j}+1\right)^{1-r}}{2 M+N}$ node smoothness: the similarity with the initialization state $\begin{array}{c} \alpha=\operatorname{Sim}\left(\mathbf{x}_{v}^{k}, \mathbf{x}_{v}^{0}\right) \\ \beta=\operatorname{Sim}\left(\mathbf{x}_{v}^{k}, \mathbf{x}_{v}^{\infty}\right) \\ N S L_{v}(k)=\alpha *(1-\beta) \end{array}$ </li> <li> The number of transformation is $D_t$, the number of propogation is $D_p$ </li> </ul> <h5 id="misconception">Misconception</h5> <h6 id="oversmoothness-is-not-the-main-contributor">Oversmoothness is not the main contributor</h6> The experiment is to have double propogation and transofrmation to test the performance. When aggregation double, the performance does not matters too much in layer 8 with 16 propogations. Oversmooth is not the main concept. Also，with less smoothness, the performance does not change too much. But with more parameters, performance indeed drops. So does more parameter cause the overfiting. <h6 id="overfiting-is-not-the-main-concept">Overfiting is not the main concept</h6> GCN on both train and test accuracy drop which is not overfiting, which reach the train acc: 100%. It is underfitting. <h6 id="entangle-and-disentangle">Entangle and disentangle</h6> Entanglement with residual will have much slower performance drop while disentangle model performance drop more quickly. <h5 id="the-key-cause">The key cause</h5> MLP with no residual will drop on this state when stack more MLP. The performance will decade without residual connection. <ul> <li>Why need deep EP? sparse graph (and large diameter) How? combine features in different steps.</li> <li>Why need deep ET? large graph with more information. How residual, jump connection</li> </ul> <h3 id="others-in-oversmoothness">Others in oversmoothness</h3> <h4 id="madgap-measuring-and-relieving-the-over-smoothing-problem-for-graph-neural-networks-from-the-topological-view">MADGap: Measuring and Relieving the Over-smoothing Problem for Graph Neural Networks from the Topological View</h4> This paper provide two quantity measurement for analysis, MAD for smoothness (similarity in nodes), and MADGap for oversmoothness, which measure the informaion-noise ratio (inter-class and intra-class). With these findings, the paper proposed MADgap regularization and adaedge to remove the intra-class edges. The smoothness is measured by: $D_{i j}=1-\frac{\boldsymbol{H}_{i,:} \cdot \boldsymbol{H}_{j,:}}{\left|\boldsymbol{H}_{i,:}\right| \cdot\left|\boldsymbol{H}_{j,:}\right|} \quad i, j \in[1,2, \cdots, n]$ The observation is that in the low layer, the information to noise ratio is larger with local neighborhoods. MAD value of high-layer GNNs gets close to 0 The MADGAP is defined by $\text{MADGap}=MAD^{rmt}-MAD^{neb}$ rmt is the MAD value remote nodes in graph topology The regularization is defined as the MADGap <h3 id="reading-list">Reading list</h3> <ul> <li> Evaluating Deep Graph Neural Networks </li> <li> On Provable Benefits of Depth in Training Graph Convolutional Networks (Towrite) </li> <li> ADAPTIVE UNIVERSAL GENERALIZED PAGERANK GRAPH NEURAL NETWORK </li> <li> Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks (Towrite) </li> <li> SIMPLE SPECTRAL GRAPH CONVOLUTION </li> <li> ADAGCN: ADABOOSTING GRAPH CONVOLUTIONAL NETWORKS INTO DEEP MODELS </li> <li> DIRECTIONAL GRAPH NETWORKS (graph classification) not so related </li> <li> Graph Neural Networks Inspired by Classical Iterative Algorithms (Towrite) </li> <li> Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks (Towrite) </li> <li> Bag of Tricks for Training Deeper Graph Neural Networks: A Comprehensive Benchmark Study </li> <li> Training Graph Neural Networks with 1000 Layers (Toread, not so related) </li> <li> Optimization of Graph Neural Networks: Implicit Acceleration by Skip Connections and More Depth (toread) </li> <li> GRAND: Graph Neural Diffusion(toread) </li> <li> ON THE BOTTLENECK OF GRAPH NEURAL NETWORKS AND ITS PRACTICAL IMPLICATIONS </li> <li> Revisiting “Over-smoothing” in Deep GCNs </li> <li> Evaluating deep graph neural networks </li> <li> Simple and Deep Graph Convolutional Networks </li> <li> DROPEDGE: TOWARDS DEEP GRAPH CONVOLUTIONAL NETWORKS ON NODE CLASSIFICATION </li> <li> PAIRNORM: TACKLING OVERSMOOTHING IN GNNS </li> <li> Measuring and Relieving the Over-smoothing Problem for Graph Neural Networks from the Topological View </li> <li> Continuous Graph Neural Networks (toread) </li> <li> Towards Deeper Graph Neural Networks </li> <li> GRAPH NEURAL NETWORKS EXPONENTIALLY LOSE EXPRESSIVE POWER FOR NODE CLASSIFICATION </li> <li> MEASURING AND IMPROVING THE USE OF GRAPH INFORMATION IN GRAPH NEURAL NETWORKS </li> <li> Optimization and Generalization Analysis of Transduction through Gradient Boosting and Application to Multi-scale Graph Neural Networks (toread) </li> <li> Graph Random Neural Networks for Semi-Supervised Learning on Graphs </li> <li> scattering GCN: Overcoming Oversmoothness in Graph Convolutional Networks </li> <li> Towards Deeper Graph Neural Networks with Differentiable Group Normalization </li> <li> Bayesian Graph Neural Networks with Adaptive Connection Sampling (toread) </li> <li> Predict then Propagate: Graph Neural Networks meet Personalized PageRank </li> <li> Representation Learning on Graphs with Jumping Knowledge Networks </li> <li> DeepGCNs: Can GCNs Go as Deep as CNNs? (image, not so related) </li> <li> Revisiting Graph Neural Networks: All We Have is Low-Pass Filters </li> </ul> <h3 id="reading">reading</h3> Intuitively, the desirable representation of node features does not necessarily need too many nonlinear transformation f applied on them. This is simply due to the fact that the feature of each node is normally one-dimensional sparse vector rather than multi-dimensional data structures, e.g., images, that intuitively need deep convolution network to extract high-level representation for vision tasks. This insight has been empirically demonstrated in many recent works, showing that a two-layer fully-connected neural networks is a better choice in the implementation. </article> <article> <h1>A review on graph classification</h1> 2022-02-08T00:00:00+00:00 <h1 id="a-review-on-graph-classification">A review on Graph Classification</h1> Graph Classification is a traditional task in Graph Domain. However, it is hard to tell which method is the state-of-the-art in this domain. Meanwhile, reviewing on Graph Classification task is also a wonderful journey to see how GNN grows from CNN with many meaningful attempts. In this blog, we are glad to give an introduction on this familar but strange topic: graph classification after a literature review on more than 30 papers. In this review, we aim to answer the following questions: <ul> <li>How does graph classification borrows ideas from Image Classification？</li> <li>How to add pooling operation for a good graph representation? (A cluster perpective borrowed from CV)</li> <li>How to find good aggregation function for a good graph representation? (A graph isomorphism perspective from WL-test)</li> <li>Occam’s Razor: Rethinking on the neccerarities on advance operations</li> </ul> Additional we will talk about: <ul> <li>Interesting traditional Graph Kernel methods</li> <li>OOD (Out of Distribution) problem on Graph Classification (Yiqi will be invited for this part)</li> </ul> <h2 id="how-does-graph-classification-borrows-ideas-from-image-classification">How does graph classification borrows ideas from Image Classification?</h2> It is not hard to think about the relationship between graph classification and Image classification. Image can be viewed as a specific graph which each node is a pixel which has three features on RGB, and the edge structure is more like grid. It seems natural to generalize the idea from image to graph. As most graphs have the similar properties like image which are locality, stationarity, and composionality. CNN has followining advantages: <ul> <li>Comparing with methods from spectral domain which rely on fixed spectrum of graph Laplacian for single structure, CNN can handle graphs with varying size and connectivity</li> <li>Comparing with differentiable Neural Graph Fingerprint (another family of method inspired by Fingerprint), feature are not directed sum up localized vertex features, weights in CNN ensure the power to filter the unimportant features.</li> <li>Comparing with traditional Graph Kernel methods, the time complexity comes down from quadratic to linear on the number of nodes</li> </ul> However, the key challenge for applyiong CNN on graph data is Fix order. Images and sentences have their order, which is of great significance. For example, image with or without order has great differences! So what needed to be solve is that: how to find the order in graph structure. More specific, how to find local receptive field and node order in the local receptive. An example to give a closer look on what’s connection. An image can be represented as a square grid graph whose nodes represent pixels. A CNN can be seen as traversing a node sequence (The red node) and generating fixed-size neighborhood graphs with certain order. What needed to be solve is how to transvere and generate the neighbor with order. To solve this problem, we introduce three methods with genius design: <ul> <li>PatchSAN: node label for extracting locally connected regions from graph, normalized neighborhood for feature compress</li> <li>ECC:</li> <li>DGCNN: global graph topology for node sorting</li> </ul> We will detailed these algorithms in the following part. <h3 id="patchy-san-pscn-learning-convolutional-neural-networks-for-graphs">PATCHY-SAN (PSCN): Learning Convolutional Neural Networks for Graphs</h3> The key idea in PATCHY-SAN is how to extract locally connected regions from graphs which serves as the receptive fields of a convolutional architecture. Revolving on this goal, two steps are constructed: <ul> <li>Determine the node sequences for which $k$ nodes are selected into neighbor graph with fixed order.</li> <li>Normalize a graph: turn a graph representation into a vectotar representation.</li> </ul> The corresponding challenge are: <ul> <li>How to determine who is neighbor node similarly with the physic position in image</li> <li>How to make unique mapping that nodes with similar structural roles are positioned similar in the vector representation</li> </ul> Node labeling aims to learn a funciton $\mathcal{l}: V \to S$ from vertices to ordered set. WL-test is one of the typical labeling technique, which is also injective (A unique adjacency matrix is given). Each node is mapped into a label, nodes with the same label are in the same set. Neighbor Determining： First select the top $w$ elements as candidate, select neighbor (determined by the node labeling) from a sequence which generate from the center node (red node in CNN). Here has no order Graph Normalize: normalizing the neighborhood assembled. The target is that graph distance in the origin space can be reconstructed as much as possible in the feature space. \[\hat{\ell}=\underset{\ell}{\arg \min } \mathbb{E}_{\mathcal{G}}\left[\left|\mathbf{d}_{\mathbf{A}}\left(\mathbf{A}^{\ell}(G), \mathbf{A}^{\ell}\left(G^{\prime}\right)\right)-\mathbf{d}_{\mathbf{G}}\left(G, G^{\prime}\right)\right|\right]\] where $A^l$ is the labeling procedure. So the label is just the inverse of the rank for $d(u,v) < d(w,v) \to r(u) < r(w) \to \mathcal{l}(u) > \mathcal{l}(v)$. NAUTY is used as the labeling method, which accepts prior node partitions as input and breaks remaining ties by choosing the lexicographically maximal adjacency matrix. In a nutshell, just labeling, select node and order according to labeling <h3 id="dgcnn-an-end-to-end-deep-learning-architecture-for-graph-classification">DGCNN: An End-to-End Deep Learning Architecture for Graph Classification</h3> Yet, the above method still has much merits: <ul> <li>Lack of the understanding graph from the node feature.</li> <li>Still use the traditional node labeling technique for sorting which is not only time-cosuming, but also lack of global topology view.</li> </ul> Then our question is: how to take advantage of the great deep learning power to find the receptive field (non-local with global topology) and determine the order. The key idea is a deep learning version WL test. The steps are as following: To extract node feature and structure information, a diffusion based graph convolution layer: $Z = f(\tilde{D}^{-1}\tilde{A}XW)$ where $\tilde{A} = A + I$, <code class="language-plaintext highlighter-rouge">$\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}$</code>. $f()$ is the tanh activation function Notice that view $Y = XW$，the procedure is similar to 1-WL test. Then $Z$ can be viewed as a WL signature vector. The non-linear function can be viewed as mapping to the new color. To find the order of nodes: sortpooling is proposed. Notice that the above GNN can be viewed as a WL-test procedure. It is also a labeling procedure! The continuous WL color $Z^t$ is used for sorting. Sort principle: using the last channel of $Z^h$ in a descending order, if the first channel is the same, then compare the second one. Then, unimportant node is dropped directly, and only $k$ nodes are remained. After sorting, we have node representation with sorted. 1-D conv and MLP is applied on it. The key difference between DGCNN and PATCHYSAN is that PATCHYSAN only use the traditional graph label, but DGCNN use the multi-dimension GNN. Additionally, sortpooling can drop some nodes which are less informative. Overall speaking, the above methods focus on how to sort and find the order in graph, in order to apply CNN on top of it. However, the order in graph is somehow difficult to find correctly, it may not be a good idea to reused the CNN components in Graph domain. Then an interesting topic has been proposed: How to change the CNN component in CV into a graph version. CNN is easy to correspond to GNN. Then what is the corresponding to the popular pooling component? <h2 id="how-to-add-pooling-operation-for-a-good-graph-representation">How to add pooling operation for a good graph representation?</h2> As mentioned above, it seems nature to design pooling specfic for graph for <ul> <li>General GNNs are inherently flat, which do not learn representation for a group of nodes.</li> <li>The common pooling (readout function) is always global view, i.e. sum, mean all node features, which lost much structure information. The complex topological structure of graphs precludes any straightforward</li> </ul> Can we have local pooling like CV, following a structure: CNN-pooling-CNN-pooling-(flatten)-MLP classifier like. To design local graph pooling, following chanlleges should be solved: <ul> <li>What is the local patch that should be pooled? clustering or downsampling.</li> <li>What will the graph be like after pooling for the node number becomes smaller than before? What is the node feature and new graph structure? selection or aggregation.</li> </ul> To solve them challenge, various methods is proposed which can be roughly categorized into: <ul> <li>parameter-free clustering pooling (chanllege 1) including cliquePooling, GRACLUS Pooling</li> <li>model-based pooling (chanllege 1) <ul> <li>Selection pooling (challenge2) <ul> <li>TopKpooling, SAGPooling, HGP</li> </ul> </li> <li>Cluster pooling (challenge2) <ul> <li>spatial persepective including diffpooling</li> <li>spectral perspective including Eigenpooling, mincut pooling</li> </ul> </li> </ul> </li> </ul> Paremeter-free clustering pooling uses the traditional machine learning method to pre-compute the cluster matrix for pooling based on graph-theoretical property. However, it neither consider the node feature nor adapts to specific task. We will not discuss in details here. Here we focus on the end-to-end method model-based pooling and the second question: What will the graph like after pooling for the node number becomes smaller than before? <h3 id="cluster-pooling">Cluster pooling</h3> Cluster pooling cluster similar nodes into one super node node by exploiting their hierarchical structure. It shares the following steps: <ul> <li>Predefine the cluster number $n^{l+1}$ (the node number after pooling)</li> <li>Learn an assign matrix $S \in \mathbb{R}^{n_l\times n_{l+1}}$ ）Softly divide origin graph into subgraphs. However, in this case,the assign matrix is a computational heavy dense matrix</li> <li>Coarsen graph: Update node feature in the same cluster into one supernode $X^{(l+1)} = S^{(l)^T} Z^{(l)}$</li> <li>Update adjacent matrix $A^{(l+1)} = S^{(l)^T} A^{(l)}S^{(l)}$</li> </ul> The key challenge in cluster pooling is how to learn an informative cluster matrix. To achieve this goal, understanding from spatial and spectral (spectral cluster) perspective are proposed. We will first introduce the spatial ones. <h4 id="diffpool-hierarchical-graph-representation-learning-with-differentiable-pooling">Diffpool: Hierarchical Graph Representation Learning with Differentiable Pooling</h4> Diffpool is the first paper on hierarchical graph pooling and define the above learning scheme. How to learn $S$ is simple and parallel with GNN model. The hidden representation is learnt by $Z^{(l)}=\operatorname{GNN}_{l, \text { embed }}\left(A^{(l)}, X^{(l)}\right)$ and the assign matrix $S$ is learnt by $S^{(l)}=softmax \left (\operatorname{GNN}_{l, \text { embed }}\left(A^{(l)}, X^{(l)}\right) \right)$ The output dimension is the predefine cluster numbers. Notice that Z is not the node hidden representation for classification, but parallel with the original GNN model with different learnable parameter. However, $S^{(l)}$ is a large dense matrix with quadratic node size. where the former selection pooling will solve this problem. <h4 id="structpool-structured-graph-pooling-via-conditional-random-fields">STRUCTPOOL: STRUCTURED GRAPH POOLING VIA CONDITIONAL RANDOM FIELDS</h4> It is hard to write this part for too many prior knowledge on CRF is needed, I will write about it later~ Then we will introduce from spectral perspective inspired from the spectral pooling. <h4 id="eigenpooling-graph-convolutional-networks-with-eigenpooling">EigenPooling: Graph Convolutional Networks with EigenPooling</h4> In DiffPooling, $S$ is a global soft assignment matrix with global structural information. However, pooling should also taken local structural information (subgraph) into consideration. It is hard to extract without the help of local spectral graph signal: <ul> <li>the subgraphs may contain different numbers of nodes, thus a fixed size pooling operator cannot work for all subgraphs</li> <li>the subgraphs could have very different structures, which may require different approaches to summarize the information for the supernode representation.</li> </ul> To extract this information, graph fourier transformation with Laplacian matrix is introduced with understanding in spectral domain. EigenPooling focuses on how to preserve the graph signal in pooling. Assign matrix is learnt from spectral cluster (we will introduce on the fundmental machine learning blog). Unlike the soft assignment, the graph nodes are divided into different subgraphs. For each subgraph(cluster) $G^k$, it has an indicate matrix $C^k\in \mathbb{R}^{N\times N_k}$ where $N_k$ is the node number in this cluster. $C^k[i,j]=1$ means node $i$ is the $j_{th}$ node on the $k$ cluster. Update adjacent matrix The adjacent matrix is update by the following four steps to remain only inter-cluster connection <ul> <li> Intra adjacent matrix by individual subgraphs. Only remain connection within subgraph $G^k$ $A^k = (C^k)^TAC^k$ </li> <li> Then we concate intra adjacent matrix for each graph into a whole adjacent matrix as $A_{int} = \sum_{k=1}^KC^kA^k(C^k)^T$ </li> <li> Then the inter adjacent matrix is the complement one as $A_{ext} = A - A_{int}$ </li> <li> Finally it will generate the coarsen graph $A_{coar} = S^TA_{ext}S$ </li> </ul> **Feature Update: ** Unlike directly update the feature, it first updates the graph signal. then the feature will update according to the new signal. It has following steps: <ul> <li>extract subgraph signal (eigenvector) in spectral domain by matrix decomposition as $u_1^k, \cdots, u_{N_k}^k$</li> <li>Upsample the subgraph signal into whole graph signal after clustering: $\bar{u}_1^k = C^ku_l^k$.</li> <li>Therefore, for each subgraph, we have signal $\Theta_l = [\bar{u}_1^k, \cdots, \bar{u}_1^k ] \in \mathbb{R}^{N\times K}$</li> <li>Trasfer feature according to the signal: $X_l = \Theta^T_lX$ No need to transform all the graph signal here. Just the important low signal is enough.</li> </ul> <h4 id="mincutpooling-spectral-clustering-with-graph-neural-networks-for-graph-pooling">MinCutPooling: Spectral Clustering with Graph Neural Networks for Graph Pooling</h4> Though eigenpooling is good to preserve the graph signal, it somehow suffers from the computational expensive for large graph comes from two ways: <ul> <li>Use spectral clustering which needs Eigenvalue decomposition on Laplacian matrix</li> <li>Use Eigenvalue decomposition to extract the subgraph spectral signal</li> </ul> To avoid expensive eigendecomposition on spectral clustering, Mincut pooling design a continous relaxation of the normalized minCUT problem which trained end to end with GNN. The contributions are: <ul> <li>formulate a continuous relaxation of the normalized minCUT problem can be optimized by GNN</li> <li>learns the solution taken node features into consideration</li> </ul> **Assignment matrix **is similar with diffpool with a parallel GNN learn for assign matrix $\begin{aligned} \overline{\mathbf{X}} &=\operatorname{GNN}\left(\mathbf{X}, \tilde{\mathbf{A}} ; \boldsymbol{\Theta}_{\mathrm{GNN}}\right) \\ \mathbf{S} &=\operatorname{SOFTMAX}\left(\operatorname{MLP}\left(\overline{\mathbf{X}} ; \boldsymbol{\Theta}_{\mathrm{MLP}}\right)\right) \end{aligned}$ **The optimization object for minCUT ** is: $\mathcal{L}_{u}=\mathcal{L}_{c}+\mathcal{L}_{o}=\underbrace{-\frac{\operatorname{Tr}\left(\mathbf{S}^{T} \tilde{\mathbf{A}} \mathbf{S}\right)}{\operatorname{Tr}\left(\mathbf{S}^{T} \tilde{\mathbf{D}} \mathbf{S}\right)}}_{\mathcal{L}_{c}}+\underbrace{\left\|\frac{\mathbf{S}^{T} \mathbf{S}}{\left\|\mathbf{S}^{T} \mathbf{S}\right\|_{F}}-\frac{\mathbf{I}_{K}}{\sqrt{K}}\right\|_{F}}_{\mathcal{L}_{o}}$ where $|| \cdot ||_F$ is the Frobenius norm $\tilde{D}$ is the normalized degree matrix. $\mathcal{L}_{c}$ is a mincut solution, which will reach optimal when it just fits the $K$ component in graph. However, it may falls into trivial solution for assign all matrix into one cluster. Then $\mathcal{L}_{o}$ the regularization term is used for regularization, balance the number of nodes in one cluster. Feature and adjacent matrix update is similar with the main framework. Additionally, to make adjacent matrix with too much self-connection and encourage the internal connections. Self loop is removed as following. $\hat{\mathbf{A}}=\mathbf{A}^{\text {pool }}-\mathbf{I}_{K} \operatorname{diag}\left(\mathbf{A}^{\text {pool }}\right) ; \quad \tilde{\mathbf{A}}^{\text {pool }}=\hat{\mathbf{D}}^{-\frac{1}{2}} \hat{\mathbf{A}} \hat{\mathbf{D}}^{-\frac{1}{2}}$ <h3 id="selection-pooling">Selection pooling</h3> Selection pooling obtains a score of each node using information from graph convolutional layers, and then drop unnecessary nodes with lower scores at each pooling step. It shares the following steps: <ul> <li>Give each node a score, score list $y\in \mathbb{R^{n \times 1}}$</li> <li>Select the top $k$ nodes. $idx = rank(y,k)$</li> <li>New node feature is the selected node feature. $X^{l+1} = X^l(idx, :)$</li> <li>New adjacent matrix is the edge between selected nodes. $A^{l+1} = A^l(idx, idx)$ (If the adjacent matrix is too sparse, you can use $A^2$ to include two-hop neighborhood)</li> </ul> Pooling just selection the important nodes for a smaller graph which is not only easy to locate the node origin position on origin graph but also not computationally expensive. The assumption is that the selected nodes already have the neccerary information. The challenge in selection pooling is that: <ul> <li>In step 1 and 2, it generates a discrete index list, which not be trainable.</li> <li>Selection is too simple which may lose essential feature and structure information</li> </ul> Revolving on these chanllenge, we will introduce the following papers. <h4 id="topkpooling-graph-u-nets">TopKPooling: Graph U-Nets</h4> This paper is not exactly about pooling, but learn from a famous architecture in CV: U-Net. In order to following U-Net, it is neccerary to design graph pooling and graph unpooling operatation. For graph pooling, it follows the selection procedure. To give each node a score, it projected nodes on a learnable vector. $y = X^l\mathbf{p}^l/||\mathbf{p}^l||$ To solve the trainable challenge, $y$ also serves as a gate for features. $\tilde{y} = sigmoid(y(idx))$ \[\tilde{X}^{l+1} = X^{l+1} \odot (\tilde{y}1_C^T)\] However, I think only use a linear projection for score is too simple, which will lead to select many nodes with similar feature. For graph unpooling, it utilize the advantage of selection pooling: easy to track where the node originally is. The unpooling is just put the learned node feature on the original graph and other node feature set to 0 $X^{l+1} = distribute(0_{N\times C}, X^l,idx)$ The zero node feature will be fill by message passing in the following GCN layers. The model architecture is as follows. Notice that a residual link on the same pooling and unpooling layer. <h4 id="sagpooling-self-attention-graph-pooling">SAGPooling: Self-Attention Graph Pooling</h4> SAGPooling is quite similar with TopKPooling, the differences are (1) model architecture: similar with diffpooling(2) scoring function. For scoring function: The difference is that it puts the projection matrix before activation Sortpooling layer can be written as $Y=\sigma\left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} X \Theta_{a t t}\right)$ where $\Theta_{a t t}\in \mathbb{R}^{F\times 1}$ , the activation function is tanh TopKpooling can be rewritten as: $Y=\sigma(\sigma\left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} X W\right) \Theta_{a t t} / ||\Theta_{a t t}||_2)$ where the activation function is sigmoid. Other operation are same as TopKPool. Maybe this scoring is better than TopKPool, or the achitecture is better. <h4 id="hgp-hierarchical-graph-pooling-with-structure-learning">HGP: Hierarchical Graph Pooling with Structure Learning</h4> This paper focus on the problem that the second challenge selection is too simple which may lose essential feature and structure information. For feature, HGP adaptively select node with both node features and graph topological. The scoring function is designed by that if node information contains by neighborhood. If it can be reconstructed by neighbor, it can be deleted with no information loss. $Y = ||(I_i^k-(D_i^k)^{-1}A_i^k)H_i^k||_1$ The more information remain after remove the information from neighbor, the more importance to preserve the node For structure, HGP refine the graph structure to preserve the key substructure With the selected feature and selected adjacant matrix, $H^{l+1} = H^l(idx, :)$ $A^{l+1} = A^l(idx, idx)$ The structural refine is a feature attention graph similar to GAT: $\mathbf{E}_{i}^{k}(p, q)=\sigma\left(\overrightarrow{\mathbf{a}}\left[\mathbf{H}_{i}^{k}(p,:) \| \mathbf{H}_{i}^{k}(q,:)\right]^{\top}\right)+\lambda \cdot \mathbf{A}_{i}^{k}(p, q)$ where $\overrightarrow{\mathbf{a}}\in \mathbb{R}^{2d\times 1}$. Then it is sent to a sparse softmax function, to learn a sparse graph structure. <h3 id="go-beyond-pooling-gmn-with-clustering-memory-based-graph-networks">Go beyond pooling: GMN with clustering (MEMORY-BASED GRAPH NETWORKS)</h3> Similar with the graph pooling method, GMN also keeps in mind with the graph cluster. However, it joint learn node representation and coarsen the graph without new graph structure update but a fully connnected graph which may lose much information. It uses the update query for find new structure and memory to maintain the refined graph information. It mainly has the following procedure: <ul> <li> Generate the original query $Q^0$ with MPNN (GAT) or simple MLP </li> <li> Generate the Key matrix $K \in \mathbb{R}^{n_{l+1} \times d_l}$ which is just the key cluster center. The cluster is computed by the student t-distribution for a particular query. $C_{i, j}=\frac{\left(1+\left\|q_{i}-k_{j}\right\|^{2} / \tau\right)^{-\frac{\tau+1}{2}}}{\sum_{j^{\prime}}\left(1+\left\|q_{i}-k_{j^{\prime}}\right\|^{2} / \tau\right)^{-\frac{\tau+1}{2}}}$ Then the representation is combine and aggregated as $\mathbf{C}^{(l)}=\operatorname{softmax}\left(\Gamma_{\phi}\left(\|_{k=0}^{|h|} \mathbf{C}_{k}^{(l)}\right)\right) \in \mathbb{R}^{n_{l} \times n_{l+1}}$ where $\Gamma_{\phi}$ is a $[1\times 1]$ convolutional operator for reduction. </li> <li> Value updated (new node feature) $\mathbf{V}^{(l)}=\mathbf{C}^{(l) \top} \mathbf{Q}^{(l)} \in \mathbb{R}^{n_{l+1} \times d_{l}}$ </li> <li> Update the new query set (cluster center) $\mathbf{Q}^{(l+1)}=\sigma\left(\mathbf{V}^{(l)} \mathbf{W}\right) \in \mathbb{R}^{n_{l+1} \times d_{l+1}}$ </li> </ul> The graph structure is only utilized on the encoder, other is just like an iterative node feature clustering. <h2 id="how-to-find-good-readout-function-for-a-good-graph-representation">How to find good readout function for a good graph representation?</h2> In the last part, we mainly focus on how pooling can coarse the graph. However, pooling may also induce much noise and may not be so effectiveness. Another family methods focus on another key compoent in GNN, the readout function, for example Set2Set. One direction is how to extract graph feature effectively once (without refining node feature and adjacent matrix. Another direction is how to go beyond the expression bound of 1-WL test. With greater expressive, it becomes possible to distuiguish more graphs. We will not introduce the second one in this blog for it is another big problem in GNN with heavy math. I will write another blog later about it. <h3 id="capsgnn-capsule-graph-neural-network">CapsGNN: CAPSULE GRAPH NEURAL NETWORK</h3> The major readout function often takes the input in a scalar method (compute the numerical maximum element.) which lose much properties. It is better to encode in a vector method like Capsule. It can generate multiple embedding to capture properties from different perspectives, routing preserves all the information from low-level capsules and routes them to the closest high-level capsules. CapsGNN contains three key blocks: <ul> <li>Basic node capsule extraction block: GNN</li> <li>High level graph capsule with: attention and dynamic routing</li> <li>generate class capsule for graphb classification.</li> </ul> The paper is about how to apply capsule network on graph structure data. The attention procedure is proposed for normalized the feature in different channels from different layers The rescale attention is like: $\operatorname{scaled}\left(\boldsymbol{s}_{(n, i)}\right)=\frac{F_{a t t n}\left(\tilde{\boldsymbol{s}_{n}}\right)_{i}}{\sum_{n} F_{a t t n}\left(\tilde{\boldsymbol{s}_{n}}\right)_{i}} \boldsymbol{s}_{(n, i)}$ And $F_{attn}$ is just two-layer MLP. Calculate votes:capsules of different nodes from the same channel share the transform matrix, which results in a set of votes. $V\in \mathbb{R}^{N\times C_{all}\times P \times d}$ Then dynamic routing mechaism for the final classification result. We will not detail the capsule network here for it is too complicated and is just about another topic. <h3 id="gmn-accurate-learning-of-graph-representations-with-graph-multiset-pooling">GMN: ACCURATE LEARNING OF GRAPH REPRESENTATIONS WITH GRAPH MULTISET POOLING</h3> In this paper, it first formulates the graph pooling problem as a multiset encoding problem with auxiliary information about the graph structure. Instead of using pooling for layers, it performs a single but strong single pooling both injectiveness and permutation invariance. Also, the method shows very good result on multiple tasks. The method is self-attention style with an multi dimension query set. $att(Q,K,V)=f(QK^T)V$ $Q\in \mathbb{R}^{n_1\times d_k}$ is not generated by the feature, but a learnable parameter matrix by default. To utilize GNN into this framework, Graph Multi-head Attention is defined as: $\operatorname{GMH}(\boldsymbol{Q}, \boldsymbol{H}, \boldsymbol{A})=\left[O_{1}, \ldots, O_{h}\right] \boldsymbol{W}^{O} ; \quad O_{i}=\operatorname{Att}\left(\boldsymbol{Q} \boldsymbol{W}_{i}^{Q}, \operatorname{GNN}_{i}^{K}(\boldsymbol{H}, \boldsymbol{A}), \operatorname{GNN}_{i}^{V}(\boldsymbol{H}, \boldsymbol{A})\right)$ $Q$ is a learnable parameter matrix, $K$ and $V$ are generated by two-layer GNN. Then the key component Graph Multiset Pooling with Graph Multi-head Attention is defined as: $\operatorname{GMPool}_{k}(\boldsymbol{H}, \boldsymbol{A})=\mathrm{LN}(\boldsymbol{Z}+\operatorname{rFF}(\boldsymbol{Z})) ; \quad \boldsymbol{Z}=\mathrm{LN}(\boldsymbol{S}+\operatorname{GMH}(\boldsymbol{S}, \boldsymbol{H}, \boldsymbol{A}))$ $k$ is the dimension of query size Then remove the query for a simple self attention for inter-node relation: $\operatorname{SelfAtt}(\boldsymbol{H})=\mathrm{LN}(\boldsymbol{Z}+\operatorname{rFF}(\boldsymbol{Z})) ; \quad \boldsymbol{Z}=\mathrm{LN}(\boldsymbol{H}+\mathrm{MH}(\boldsymbol{H}, \boldsymbol{H}, \boldsymbol{H}))$ The whole structure is: $\text { Pooling }(\boldsymbol{H}, \boldsymbol{A})=\operatorname{GMPool}_{1}\left(\operatorname{SelfAtt}\left(\operatorname{GMPool}_{k}(\boldsymbol{H}, \boldsymbol{A})\right), \boldsymbol{A}^{\prime}\right)$ Here $A’$ is also a coarsening adjacent matrix. So somehow, it is still hierarchical. Theorem part to be added This paper somehow astonish me on the ability of attention. However, there are some query hyperparemater to turn. I am wondering, if this can also be applied to the selection pooling, will the performance be better also? What the power of the learnable query instead of self attention. <h2 id="occams-razor-rethinking-on-the-neccerarities-on-advance-operations">Occam’s Razor: Rethinking on the neccerarities on advance operations</h2> With the development of graph classification for many years, advanced operation has been proposed. However, how they can really work well is still unknown. We have reason to doubt the effectiveness of these operation as in most case, the feature is only the one-hot node label or node degree. Not so much information is available. it is important to understand which components of the increasingly complex methods are necessary or most effective. <h3 id="a-simple-yet-effective-baseline-for-non-attribute-graph-classification">A simple yet effective baseline for non-attribute graph classification</h3> In this paper, it proposes a really simple baseline with feature augumentation by neighborhood degree. The degree of neighborhood is defined as $DN(v) = { degree(u|(u,v) \in E }$ The procedure is as follow <ul> <li>Five feature types are conducted by (degree($v$), min(DN($v$)), max(DN($v$)), mean(DN($v$)), std(DN($v$)))</li> <li>Performing either a histogram or an empirical distribution function(edf) operation, i.e, mapping all node feature into a histogram or an empirical distribution.</li> <li>SVM is used for classification</li> </ul> Since $DN(v)$ is the one-hop neighbor information. min max can be viewed as different readout function in GNN. This feature can be easily viewed as a simplified version for GNN. The remaining question are: <ul> <li>Does capture complex graph structure really help the performance?</li> <li>How to utilize the exisiting features on graph?</li> </ul> <h3 id="are-powerful-graph-neural-nets-necessary-a-dissection-on-graph-classification">Are Powerful Graph Neural Nets Necessary? A Dissection on Graph Classification</h3> This paper challenge two key component in GNN (1) non-linear graph filter (**Notice that **) (2) Readout function. It aims to answer the following questions: <ul> <li>Do we need a sophisticated graph filtering function for a particular task or dataset?</li> <li>And if we have a powerful set function, is it enough to use a simple graph filtering function?</li> </ul> Model without non-linear graph filter GFN Instead of non-linear aggregation, the aggregated feature is used as data argumentation. $X^G = [d, X, \tilde{A}^1X, \tilde{A}^2X, \cdots, \tilde{A}^KX]$ Then the readout function is: $\operatorname{GFN}(G, X)=\rho\left(\sum_{v \in \mathcal{V}} \phi\left(X_{v}^{G}\right)\right)$ where $\rho$ and $\phi$ are all MLP where $\phi$ is always with more layers. In fair comparison, similar baseline has been proposed for social data, which even with no need on aggregation feature, it can also beat the stoa GNN model. Additionally, first sum aggregate all features then applies an MLP with relu for classification can also result in good result in biochemical dataset. GFN can shows good generalization ability while with less overfitting on the training data. Model without both non-linear graph filter and non-linear readout function However, experiment is lack here with less experiment on only with non-linear graph filter. There is no wonder that a linear GNN cannot perform a good performance. <h3 id="rethinking-pooling-in-graph-neural-networks">Rethinking pooling in graph neural networks</h3> This paper is among the most superising one for its check the essential property for hierarchical pooling: Is cluster a must-have property before measuring how well the assign matrix is? (Somehow, this paper ignore the select pooling and it hard to say what does his local mean in this paper). The surprising finding are: <ul> <li>Even with the complementary for the assign matrix, comparable result can also be achieved.</li> <li>GNN can learn a smooth and homophilious node representation which makes pooling with hierarchical be less important.</li> <li>Simple GraphSAGE can also achieve good results.</li> </ul> The main experiment perspectives are as follows: For off-the-shelf graph clustering method, replace the adjacent matrix by its complementary set for clustering. For pooling methods, the main focus is on cluster assignment matrix $S$. The clustering assignment matrix can be replaced by normalized random matrix. If the clustering method is distance-based, assign node to the farthest distance instead of the closest. Surprisingly, these variants can also show comparable performance on some datasets. After analysis, the main reason comes from: nodes display similar activation patterns before pooling for GNN can extract low-frequency information across the graph. Therefore, what pooling do is just build a more similar representation. Moreover, challenges are still for homogeneous node representations even before the first pooling layer. (Does it depends on different situation?) This naturally poses a challenge for the upcoming pooling layers to learn meaningful local structures. <h2 id="graph-kernel-methods">Graph Kernel Methods</h2> Kernel trick is usually used in dual SVM. SVM can design different kernels for different specific domains. It is a specific topic for feature engineering. <h3 id="brief-introduction-on-svm-and-kernel-method">Brief introduction on SVM and kernel method.</h3> SVM is a minimal risk algorithm which wants to make the distance between nodes and decision boundary as large as possible, which wants to $\begin{array}{cl} \max _{\mathbf{w}} & \text { margin }(\mathbf{w}) \\ \text { subject to } & \text { every } y_{n} \mathbf{w}^{T} \mathbf{x}_{n}>0 \\ & \text { margin }(\mathbf{w})=\min _{n=1, \ldots, N} \operatorname{distance}\left(\mathbf{x}_{n}, \mathbf{w}\right) \end{array}$ after simplify, the form can be rewritten as: $\begin{array}{cl} \min _{\mathbf{w}} & \frac{1}{2}w^Tw \\ \text { subject to } & \min_{n=1, \cdots, n} y_{n} (\mathbf{w}^{T} \mathbf{x}_{n}+b)>0 \end{array}$ However, the time complexility of solving this problem is linear with the data dimension which grows exponentially with the faeture engineering for higher order interactions. With strong dual condition, the problem can be converted into: $\min_{b,w} \to \min_{b,w}\left(\max_{\alpha \ge 0}\mathcal{L}(b,w,\alpha) \right) \to \max_{b,w}\left(\min_{\alpha \ge 0}\mathcal{L}(b,w,\alpha) \right)$ With solve this problem, we can get: $w = \sum_{n=1}^N\alpha_n y_n z_n$ \[\sum_{n=1}^N\alpha_ny_ = 0\] where $z_n$ is $x_n$ after feature engineering. Then the problem canbe solved as: $\begin{aligned} \min _{\alpha} & \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} \alpha_{n} \alpha_{m} y_{n} y_{m} \mathbf{z}_{n}^{T} \mathbf{z}_{m}-\sum_{n=1}^{N} \alpha_{n} \\ \text { subject to } & \sum_{n=1}^{N} y_{n} \alpha_{n}=0 \\ & \alpha_{n} \geq 0, \text { for } n=1,2, \ldots, N \end{aligned}$ Then the time complexity is connected to the number of training samples. $\mathbf{z}{n}^{T} \mathbf{z}{m}$ is the design space for the kernel function！ The SVM function can be writtern as $\mathbf{y} = \left (\sum \alpha_ny_nz_n \right)\mathbf{z} + b$ Therefore, each new sample will compare the similarity with the kernel function. For kernel method it measures the similarity which corresponds to the inner product For graph kernel, it is design to measure the graph structure similarity <h3 id="traditional-graph-kernel">Traditional Graph Kernel</h3> Graph kernel learns the structural latent representation by predefined sub-structure for graphs. Most of the graph kernel methods belongs to the R-convolution, which the key idea is <ul> <li>find the atom subgraph pattern (recursively decompose into subgraph)</li> <li>$\phi(\mathcal{G})$ denotes the vector which contains counts of atomic sub-structures. A count vector with normalization.</li> <li>The similarity between graphs is computed as $K(\mathcal{G},\mathcal{G}’) = \phi(\mathcal{G}) \cdot \phi(\mathcal{G}’)$</li> </ul> Traditional graph kernel focuses on the design of atomic graph sub-structure with limited sub-graph (Graphlet), subtree pattern (WL), walk (random walk) and path (Shorest Path). subgraph (graphlet) A graphlet $G$ is an induced and non-isomorphic subgraphs. It is generated by add a node, add an edge, remove an edge. The graph kernel with $k \le 5$ will have a count vector with 52 features. subtree (WL) WL is to iterate over each vertex of a labeled graph and its neighbors in order to create a multiset label. The resultant multiset is given a new label, which is then used for the next iteration until the label set does not change anymore. To compute the similarity, just count the co-occurrences of labels in both graphs. The dimension is based on the number of iteration. Path (Shortest path) The shortest path pattern is defined as the triplet $(l_s^i, l_e^i, n_k)$. The kernel similarity just computes the co-occurance of the shortest path pattern. <h3 id="deep-graph-kernel">Deep Graph Kernel</h3> The intuition is that we count different subgraphs patterns independently. However, it is easy to see some of the subgraph patterns are dependent with each other. A similarity matrix (positive semidefinite) $M$ should be computed. $K(\mathcal{G},\mathcal{G}') = \phi(\mathcal{G})^T \mathcal{M} \phi(\mathcal{G}')$ It is learnt by the skip gram inspired by Word2Vec, and the key is to find the co-occurance between subgraph patterns. SP: the shortert path with the same source is viewed as the context WL: The multilabel set on different node but the same iteration can be viewed as the context. some weakness <ul> <li>graphlet kernel builds kernels based on fixed-sized subgraphs. These subgraphs, which are often called motifs or graphlets, reflect functional network properties. However, due to the combinatorial complexity of subgraph enumeration, graphlet kernels are restricted to subgraphs with few nodes</li> <li>WL kernel only support discrete features and use memory linear in the number of training examples at test time.</li> <li>Deep graph kernels and graph invariant kernels compare graphs based on the existence or count of small substructures such as shortest paths, graphlets, subtrees, and other graph invariants?</li> <li>All graph kernels have a training complexity at least quadratic in the number of graphs, which is prohibitive for large-scale problems</li> </ul> <h3 id="ddgk--learning-graph-representations-for-deep-divergence-graph-kernels">DDGK: Learning Graph Representations for Deep Divergence Graph Kernels</h3> DDGK is a new expressive kernel with deep learning which encodes a relaxed notion of graph isomorphism. It breaks the heuristics constraints in the traditional method. The major perspective are three-fold: <ul> <li>How to represent a graph (capture the graph information)</li> <li>How to align graphs and find the similarity? cross-graph alignment.</li> <li>How to measure the divergence score(the similarity kernel representation)</li> </ul> Notice that, it is more like a graph embedding method without the supervision signal. Graph encoder should be able to reconstruct structure on some given nodes. It aims to distinguish the label of the neighborhood node with a single linear layer (embedding lookup). $J(\theta)=\sum_{i} \sum_{j \atop e_{i j} \in E} \log \operatorname{Pr}\left(v_{j} \mid v_{i}, \theta\right)$ Cross-Graph Attention aims to measure how much the target graph diverges from the source graph, it use the source graph encoder to predict the structure of the target graph. If the pair is similar, we expect the source graph encoder to correctly predict the target graph’s structure. A bidirection projection is proposed acrossing pair of nodes. It will assign the source node to the target graph with certain probability. $\operatorname{Pr}\left(v_{j} \mid u_{i}\right)=\frac{e^{\mathcal{M}_{T \rightarrow S}\left(v_{j}, u_{i}\right)}}{\sum_{v_{k} \in V_{S}} e^{\mathcal{M}_{T \rightarrow S}\left(v_{k}, u_{i}\right)}}$ A reverse projection is similar with it. $\mathcal{M}{T \rightarrow S}\left(v{j}, u_{i}\right)$ is a simple multiclass classifier. Divergence Embedding Then each pair of nodes between two graphs are used for comparion. It measure how well the source node can predict the target node neighborhood and concate as the final embedding. $\mathcal{D}^{\prime}(T \| S)=\sum_{v_{i} \in V_{T}} \sum_{j \atop e_{j i} \in E_{T}}-\log \operatorname{Pr}\left(v_{j} \mid v_{i}, H_{S}\right)$ <h2 id="paper-reading-list">Paper Reading List</h2> <ul> <li>Nested Graph Neural Networks (NIPS)</li> <li>StructPool Structured Graph Pooling via Conditional Random Fields (ICLR)</li> <li>Rethinking pooling in graph neural networks (NIPS)</li> <li>Principal Neighbourhood Aggregation for Graph Nets (NIPS)</li> <li>MEMORY-BASED GRAPH NETWORKS (ICLR)</li> <li>A FAIR COMPARISON OF GRAPH NEURAL NETWORKS FOR GRAPH CLASSIFICATION (ICLR)</li> <li>InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization (ICLR)</li> <li>Convolutional Kernel Networks for Graph-Structured Data (ICML)</li> <li>ACCURATE LEARNING OF GRAPH REPRESENTATIONS WITH GRAPH MULTISET POOLING (ICLR2020)</li> <li>Benchmarking Graph Neural Networks (Arxiv 2020)</li> <li>Bridging the Gap Between Spectral and Spatial Domains in Graph Neural Networks (Arxiv2020)</li> <li>Open Graph Benchmark (NIPS2020)</li> <li>Tudataset A collection of benchmark datasets for learning with graphs (arxiv2020)</li> <li>Spectral Clustering with Graph Neural Networks for Graph Pooling (ICML2020)</li> <li>Graph Convolutional Networks with EigenPooling (KDD)</li> <li>HOW POWERFUL ARE GRAPH NEURAL NETWORKS? (ICLR)</li> <li>Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks (AAAI)</li> <li>Self-Attention Graph Pooling (ICML)</li> <li>Graph U-Net (ICML)</li> <li>Are Powerful Graph Neural Nets Necessary? A Dissection on Graph Classification (Arxiv)</li> <li>CAPSULE GRAPH NEURAL NETWORK(ICLR)</li> <li>CLIQUE POOLING FOR GRAPH CLASSIFICATION (Arxiv)</li> <li>Hierarchical Graph Pooling with Structure Learning (AAAI19)</li> <li>DDGK: Learning Graph Representations for Deep Divergence Graph Kernels (WWW19)</li> <li>An End-to-End Deep Learning architecture for graph classification (AAAI)</li> <li>Hierarchical graph representation learning with differentiable pooling (NIPS)</li> <li>Towards Sparse Hierarchical Graph Classifiers (Arxiv)</li> <li>A simple yet effective baseline for non-attribute graph classification (arxiv)</li> <li>Dynamic edge-conditioned filters in convolutional neural networks on graphs (CVPR)</li> <li>Learning Convolutional Neural Networks for Graphs</li> </ul> <h2 id="useful-words">Useful words</h2> Several approaches have been proposed in recent GNN literature, ranging from model-free methods that precompute the pooled graphs by leveraging graph-theoretical properties (Bruna et al., 2013; Defferrard et al., 2016), to model-based methods that perform pooling trough a learnable function of the node features (Ying et al., 2018; Cangea et al., 2018; Hongyang Gao, 2019). However, existing model-free methods only consider the graph topology but ignore the node features, while model-based methods are mostly based on heuristics. As a consequence, the former cannot learn how to coarsen graphs adaptively for a specific downstream task, while the latter are unstable and prone to find degenerate solutions even on simple tasks. A pooling operation that is theoretically grounded and can adapt to the data and the task is still missing. </article> <article> <h1>Highlight of Linear Algebra</h1> 2022-02-05T00:00:00+00:00 $Ax=b$, $Ax=\lambda x$ $Av=\sigma u$, $\min{ \frac{||Ax||^2}{||x||^2}}$ <ul> <li> Multiplication Ax using columns of A <ul> <li>What is column, row space, independent vectors, basis, rank, CR decomposition</li> <li>Matrix mutiplication, outer product</li> </ul> </li> <li> Factorization <ul> <li> $A=LU$ <ul> <li>不断消去上方单元 $Ax=b\to LUx=b \to Lc = b, c = Ux$</li> </ul> </li> <li> $A=QR$ </li> <li> 特征值分解 <ul> <li>对称矩阵，正定矩阵，半正定矩阵 (), the energy function</li> <li>Rank的一些性质</li> <li>$A+sI \to \lambda_1+s, \lambda_2+s, $</li> </ul> </li> <li> 奇异值分解 $Av_1 = \sigma_1 u_1 $ <ul> <li> derive from $AA^T$ and $A^TA$ </li> <li> How to compute </li> <li> The relation between left eigenvector and right, more vectors with 0 </li> <li> If $A-xy^T$ has rank 1, $\sigma_1 \ge |\lambda |$ </li> <li> Reflect(affine), scale, reflect, the reduced form </li> <li> The function understanding of SVD </li> <li> Function form, polar decomposition $rcos\theta+irsin\theta = re^{i\theta}$ A = QS, orthgonal, semi-positive definite, seperate rotation from strech </li> <li> \[A = U\Sigma V^T=(UV^T)(V\Sigma V^T) = QS\] </li> </ul> </li> </ul> </li> <li> orthogonal matrix and <ul> <li>Orthogonal basis: a and c $c = b - \frac{a^Tb}{a^Ta}a$</li> <li>Orthogonal projection： $P = QQ^T$ Projection matrix $P^2=P$</li> <li>co-efficient of orthogonal basis: $c_1=q_1^Tv$</li> <li>orthogonal matrix and vector will not change the matrix norm. Then the matrix norm is connected with SVD as $A=U\Sigma V^T$, as U and V are all orthogonal vectors, matrix norm can connect with $\Sigma$ matrix</li> </ul> </li> <li> norm <ul> <li>Eckart young if B has rank k, $||A-B|| \ge ||A - A_k||$</li> <li>Inner product</li> <li>Matrix Norm: three norm only effected by $\sigma$, what about others. <ul> <li>l2 norm: $\max \frac{||Ax||}{||x||} = \sigma_1$</li> <li>Frobenius: $\sqrt{\sigma_1^2+\sigma_2^2+\cdots + \sigma_r^2}$</li> <li>Nuclear norm: $\sigma_1+\sigma_2+\cdots + \sigma_r$ The minimum value of $||U||_F||V||_F$ $||A^TA||_N=||A||_F^2$</li> </ul> </li> <li>Vector Norm $l_0, l_1, l_2, l_\infty$</li> <li>The inituition or the matrix norm: The minimum of $||V||_p$</li> <li>Norm的性质： Rescale， triangle</li> <li>Function norm-> vector space to be completed $||v_n - v_\infty|| \to 0 $ <ul> <li>ending with all zero is not completed p105</li> <li>$norm < \infty$</li> </ul> </li> <li>spectral radius: about the stationary of Markov chain</li> </ul> </li> <li> Application <ul> <li> PCA <ul> <li>The stastics behind <ul> <li>variances are the diagonal entries of the matrix $AA^T$</li> <li>covariances are the off-diagonal entries of matrix $AA^T$</li> </ul> </li> <li>The geometric behind The sum of squared distances from the data points to the $u_1$ line is a minimum.</li> <li>linear algreba behind Total variance $T$ is the sum of sigma</li> <li>The quick drop of $\sigma$ in hilebert matrix</li> </ul> </li> <li> Rayleigh Quotients, generalized eigenvalue $Sx_1 = \lambda M x_1$ 99页 $R(x) = \frac{x^TSx}{x^Tx}$ <ul> <li>Generalized Rayleigh Quotients $R(x) = \frac{x^TSx}{X^TMx}$ <ul> <li>M: covariance matrix, is positive definite, maximum of $R(x)$ is largest eigenvalue of $M^{-1}S$, $M^{-\frac{1}{2}}SM^{-\frac{1}{2}}$</li> </ul> </li> <li>Generalized Eigenvectors and M-orthogonal $x_1^TMx_2 = 0$ , $x = M^{-\frac{1}{2}}y$</li> <li>Semi-definite situation $\alpha Sx = \beta Mx$ , $\alpha$ may equal to 0. number of samples smaller than features</li> <li>Generalized SVD $A=U_A\Sigma_AZ$ $B=U_B\Sigma_BZ$</li> <li>Any two positive definite matrix can be decomposed by the same inverse matrix</li> </ul> </li> <li> LDA Seperated rate $R = \frac{(x^Tm_1-x^Tm_2)^2}{x^T\Sigma_1x+x^T\Sigma_2x}$ $S = (m_1-m_2)(m_1-m_2)^T$ 求解背后的物理意义没太看懂 </li> </ul> </li> <li></li> </ul> $\lambda_1 \le \sigma_1$ AB and BA $ABx=\lambda x$, $BABx=\lambda Bx$ , $Bx$是BA的eigenvector eigenvector也有对应的关系 <h1 id="low-rank-and-compressed-sensing">Low rank and Compressed Sensing</h1> <ul> <li>Key insights <ul> <li>matrix are composed of small rank matrix: ($uv^T$) is extreme case with rank 1</li> <li>Singular value: low effective rank.</li> <li>Most matrices are completed by low rank matrix.</li> </ul> </li> <li> How matrix change when add small rank matrix <ul> <li> Normal Perspective(use small matrix and exchange for the larger matrix) <ul> <li> $A^{-1}$ $(A-UV^T)^{-1} = A^{-1}+A^{-1}U(I-V^TA^{-1}U)^{-1}V^TA^{-1}$ \[(I-UV^T)^{-1} = I +U(I-V^TU)^{-1}V^T\] </li> <li> eigenvalue and signal values (interlacing) A graphical explanation: the solution is an inverse function which the last one can never go beyond the first one. $z_1 \ge \lambda_1 \ge \cdots$ </li> </ul> </li> </ul> </li> <li> Differentiate perspective <ul> <li> $A^{-1}$ $\frac{dA^{-1}}{dt} = - A^{-1}\frac{dA}{dt}A^{-1}$ \frac{d\lambda}{dt}=y^T\frac{dA}{dt}x $$ </li> <li> \[\lambda_{max}(S+T)\le \lambda_{max}(S) + \lambda_{max}(T)\] </li> </ul> \[\lambda_{min}(S+T) \ge \lambda_{min}(S) + \lambda_{min}(T)\] However, it is hard to find the intermediate ones. <ul> <li>Weyl inequality $\lambda_{i+j-1}(A+B)\le \sigma_i(A)+\sigma_j(A)$</li> </ul> </li> <li> Saddle points from lagrange multipliers <ul> <li>Lagrangian: $L(x, \lambda) = \frac{1}{2}x^TSx + \lambda^T(Ax-b)$ which will produce a Lag</li> </ul> </li> <li>application <ul> <li>update least square, has a new row</li> <li>Kalman filter TODO reading P112</li> <li></li> </ul> </li> </ul> </article> <article> <h1>A review on heterophily graphs</h1> 2022-01-28T00:00:00+00:00 <h1 id="a-review-on-heterophily-graph">A review on heterophily graph</h1> It’s good to see you here, with my recent reading on the heterophily graph. In this blog, we aim to figure out on the recent massive paper on heterophily graph. In this review, we aim to study the following research question. <ul> <li>How does GNN work well on the homophily graph?</li> <li>What the heterophily graph datasite looks like? How to measure the heterphoily?</li> <li>What is the current solution to the heterophily graph?</li> <li>What is the connection with other problems on Graph Neural Network</li> </ul> <h2 id="how-does-gnn-work-well-on-the-homophily-graph">How does GNN work well on the homophily graph</h2> To ask this question, we first distuiguish GNN with other exisiting Euclidean-based methods. The key difference between GNN and other method is the two components in GNN. <ul> <li>Message Passing mechanism: take use of rich information from neighborhood of an object can be captured.</li> <li>Aggregator: select and compress the information from ego node feature (The feature of node itself) and neighbor feature.</li> </ul> So what is the key difference when adding them operators? The answer is smoothness. Firstly, GNN procedure can be viewed as a denoising procedure, from a noise signal (original feature) $S \in \mathbb{R}^{N \times d}$ for a clean signal (feature after GNN) $F \in \mathbb{R}^{N \times d}$. $arg \min_F \mathcal{L} = ||F-S||^2_F + c \cdot tr(F^TLF)$ where $L = D - A$ is the Laplacian matrix. It seems not so hard to understand the first term, which is common in learning good represention, for example, auto encoder. The optimization of the second term is more about message passing mechanism. Taking the second term into consideration solely the smootheness between neighbor is: $tr(F^TLF) = \frac{1}{2}c\sum_{i\in V}\sum_{j\in N(i)} ||F_i - F_j||_2^2$ which measures the difference between node $i$ and all its neighbors The minimization is similar with the well known Courannt-Fischer problem. $\lambda_1 = arg \min_F tr(F^TLF) = 0$ and $F$ is the eigenvector the corresponding to $\lambda_1$, which is an all one vector, which is extremely smoothly. Then how does this smoothness indeed help us to learn a discriminative model? With homophily assumption, it reduce the noise in the same class. In other words, reduce the intra-class variance for easier classification Smoothness can also be viewed as the noisy (variance) reduced. Assume that the noise power is the same, defined by its variance $\sigma^2$. Then after aggregation, the new variance is $\sum_{v_j \in N_{v_i}}a_{i,j}^2 \cdot \sigma^2$. where $a_{i,j}$ is the aggregator factor. For example, <code class="language-plaintext highlighter-rouge">$a_{i,j} = \frac{1}{||N_{v_i}||}$</code> for mean aggregetor Then if the label is also smooth, which means the aggregate label is still the same or much same as before. The label smoothness can be measured as $\lambda_l = \sum_{e_{v_i,v_j\in \epsilon}} (1- \mathbb{I}(v_i v_j)) / |\epsilon|$ The feature smooth can be measure as $\lambda_{f}=\frac{\left\|\sum_{v \in \mathcal{V}}\left(\sum_{v^{\prime} \in \mathcal{N}_{v}}\left(x_{v}-x_{v^{\prime}}\right)\right)^{2}\right\|_{1}}{|\mathcal{E}| \cdot d}$ If $\lambda_l$ is large, and $\lambda_f$ is small， GNN can work well which reduce the intra-class variance a lot. Anyway, whenever $\lambda_l$ is the small, GNN can help to reduce the noise for smoothness and lower intra-class variance. (The explanation is somehow still needed to be proved here. Wait for your discussion) <h2 id="how-to-measure-the-heterphoily">How to measure the heterphoily?</h2> In this section, we want to ask how does the heterphoily looks like, and why the original GCN can not work well on some cases. <h3 id="the-measurement-for-heterphoily">The measurement for heterphoily</h3> Basically, when we talk about measurement, it is the label measurement in most cases. Similarly, the measurement metrics on homophoily and heterphoily are just opposite to each other, high homophily means low heterphoily. Then two basic measurement are <ul> <li>Edge homophily ratio: <code class="language-plaintext highlighter-rouge">$h=\frac{\left|\left\{(u, v):(u, v) \in \mathcal{E} \wedge y_{u}=y_{v}\right\}\right|}{|\mathcal{E}|}$</code></li> <li>Node homophily ratio: <code class="language-plaintext highlighter-rouge">$h= \frac{1}{|\mathcal{V}|} \sum_{v\in \mathcal{V}} \frac{\left|\left\{(u, v): v \in \mathcal{N}_v \wedge y_{u}=y_{v}\right\}\right|}{d_u}$</code></li> </ul> Notice that, please be careful whether your graph is a directed one or indirected one. This setting will definitely influence you performance! The above measurement is naive and intuitive, with also drawback existing. Some new measurement are proposed. <h4 id="compatibility-matrix">Compatibility matrix</h4> What problem to solve？ The homophily level varies aming different pair of classes, measurement should e be class-wise. The compatibility matrix is defined as follows： $\mathbf{H}=\left(\mathbf{Y}^{\top} \mathbf{A} \mathbf{Y}\right) \oslash\left(\mathbf{Y}^{\top} \mathbf{A} \mathbf{E}\right)$ $Y\in \mathbb{R}^{|\mathcal{V}|\times |\mathcal{Y}|}$ is a class indicator matrix. $|\mathcal{V}|\times |\mathcal{Y}|$ is a $|\mathcal{V}|\times |\mathcal{Y}|$ all-ones matrix. $\oslash$ is Hadamard (element-wise division), In the $H$ matrix, the diagonal elements measure the homophily. <h4 id="class-wise-measurement">Class-wise measurement</h4> What problem to solve? <ul> <li>Number of classes matters! Heterophily means labels with different classes. However, “difference” has different meaning in a dataset set with 6 classes and 2 classes. Heterophily means labels with other 5 classes and other 1 classes respectivelt.</li> <li>class balance matters! For instance, if 99% of nodes were of one class, then most edges would likely be within that same class, so the edge homophily would be high.</li> </ul> To overcome such weakness, the class-heterophily measurement is proposed as: $\hat{h}=\frac{1}{|C|} \sum_{k=1}^{|C|}\left[h_{k}-\frac{\left|C_{k}\right|}{n}\right]_{+}$ where $[a]_+=max(a,0)$ and $h_k $is a class-wise metric， the second term is the average value for a randomly connected graph. $h_k = \frac{\sum_{v\in C_k}\left|\left\{(u, v): v \in \mathcal{N}_v \wedge y_{u}=y_{v}\right\}\right|}{\sum_{v\in C_k}d_u}$ <h4 id="aggregation-similarity-score">Aggregation Similarity score</h4> A new perspective from back propogation: whether homophily node or heterphoily contributed more on the back propogation direction.. For a GNN without activation function, $Y = softmax(\hat{A}XW) = softmax(Y')$ \[\bigtriangleup Y' = \mathbf{\hat{A} X X^{T} \hat{A}^{T}}(Z-Y)=S(\hat{A}, X)(Z-Y)\] $Z-Y$ is the prediction error matrix. $Z$ is the ground truth matrix $S(\hat{A}, X)$ determines where to be updated. Then the aggregation similarity score is defined as: which is to determine whether the homophily node contribute more or heterophily node contribute more. $S_{a g g}(S(\hat{A}, X))=\frac{\left|\left\{v \mid \operatorname{Mean}_{u}\left(\left\{S(\hat{A}, X)_{v, u} \mid Z_{u,:}=Z_{v,:}\right\}\right) \geq \operatorname{Mean}_{u}\left(\left\{S(\hat{A}, X)_{v, u} \mid Z_{u,:} \neq Z_{v,:}\right\}\right)\right\}\right|}{|\mathcal{V}|}$ The author thinks that his metric can identify the harmful heterophily and the metric wil not take useful heterophily into account. <h4 id="cross-class-neighborhood-similarity">Cross-class Neighborhood Similarity</h4> Just the similarity between feature on different classes. $s(c, c') = \frac{1}{\left|\mathcal{V}_{c}\right|\left|\mathcal{V}_{c^{\prime}}\right|} \sum_{i \in \mathcal{V}_{c}, j \in \mathcal{V}_{c^{\prime}}} \cos (d(i), d(j))$ <h3 id="how-to-conduct-a-synthetic-hetephoily-from-homophoily-dataset">How to conduct a synthetic hetephoily: From homophoily dataset</h3> <h4 id="the-drawback-of-real-world-data">The drawback of real world data</h4> Commonly used data is proposed by GEOM-GCN: GEOMETRIC GRAPH CONVOLUTIONAL NETWORKS. However, it still has several problems on it. <ul> <li>WebKB benchmarks: relatively small sizes</li> <li>Unreliable label assignment: Squirrel and Chameleon have class labels based on ranking of page traffic</li> <li>Unusual network structure: quirrel and Chameleon are dense, with many nodes sharing the same neighbors</li> </ul> Also, new good large benchmark dataset has been proposed recently. Please reference Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods <h4 id="the-procedure-to-generate-a-dataset">The procedure to generate a dataset</h4> <ul> <li>The number of class is presscribed</li> <li>start from a small initial graph, new nodes are added into the graph one by one.</li> <li>The probability puv for a newly added node $u$ in class $i$ to connect with an existing node $v$ in class $j$ is proportional to both the class compatibility $H_{ij}$ between class $i$ and $j$, and the degree $d_v$ of the existing node $v$.</li> <li>the degree distribution for the generated graphs follow a power law, and the heterophily can be controlled by class compatibility matrix $H$</li> <li>Node feature is sampled from the corresponding class in the dataset like Cora.</li> </ul> The detailed can be found in Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs <h4 id="modify-a-dataset-from-dataset-of-the-existing-homophily-dataset">Modify a dataset from dataset of the existing homophily dataset</h4> The key is that how to generate the heterophilous edge? The key is to add synthetic,cross label edge s that connect nodes with different labels. where $\gamma$ is the noise probability, $D_c, c \in \mathcal{C}$ is a discrete neighborhood distribution for each class $c \in \mathcal{C}$, which has been predefined. For example, the neighborhood distribution for class $0$ is: $D_0 = catrgorical([0, 1/2, 0, 0, 0, 0, 1/2])$. Notice that this probability is predefined. This leads to quite interesting experiments result. Please refer to the original paper. The detailed procedure can be seen in Is Homophily a Necessity for Graph Neural Networks? <h4 id="csbm-generated-conduction">CSBM generated conduction</h4> CSBM(contextual stochastic block model) is a graph generator, which is a generative model for random graphs. It has been widely used in graph clustering. It has such feature: <ul> <li>node features are Gaussian random vectors, where the mean of the Gaussian depends on the community assignment.</li> <li>The difference of the means is controlled by a parameter $\mu$</li> <li>the difference of the edge densities in the communities and between the communities is controlled by a parameter $\lambda$. $\lambda>0$ correspond to homophilc graphs.</li> </ul> We will have more detailed introduction about it later <h2 id="what-is-the-current-solution-to-the-heterophily-graph">What is the current solution to the heterophily graph?</h2> I think this will be the most exiciting part when summary recent progress. In this section, we will introduce from the mulitple perspectives. Still all our perspective will focus on this formulation: $arg \min_F \mathcal{L} = ||F-S||^2_F + c \cdot tr(F^TLF)$ The design space first focuses on the first term, which means to preserve the original feature. Correct, if the heterphoily neighbor may disturb it. Adaptive self-loop is necceary Then for the second term with smoothness assumption, the questions is raised: Is smoothness hurt the distuiguishbility? From the frequency perspective, low signal may be good, however, if in the optimal case $\lambda=0$, much information is lose in this procedure. Then the solution will be: <ul> <li>Instead of node similarity, we want to find the neighbor difference, or can be called high frequency signal</li> <li>find more diverse neighbor to inlarge receptive field for more information <ul> <li>Conduct more adjacent matrixes than $L$ for more useful signal. For example KNN graph,, structure-graph.</li> <li>build deeper GNN and select the useful neighbor, In other word $tr(F^TLF)$ becomes $tr(F^T (\alpha_1 L^1+\alpha_2 L^2+\alpha_3 L^3)F)$. May be the third order neighbor can help become a homophily graph. In fact, oversmoothness and heterphily problem are just the two sides of a coin</li> </ul> </li> </ul> The following topic will focus on these designing space: <ul> <li>Keep origin feature and find differences with neighborhood</li> <li>Find more useful adjacent matrix</li> <li>Deeper GNN for larger receptive field</li> </ul> Some of our introduced method will include more than one of these perspectives. Unfortunetely, some good model can not be concluded in this framework, which include LinkX and CPGCN. Our discussion is based on the paperlist mentioned at the end of this blog. <h3 id="keep-origin-feature">Keep origin feature</h3> First we introduce a word called: Node feature similarity, which is opposite to the neighbor smoothness. Concretely speaking, the aggregation process of GNNs try to find preserve structural similarity, while tends to destroy node similarity in the original feature space. To keep the original feature, self loop is of great necceraity and other feature preserve method <h4 id="simp-gcnnode-similarity-preserving-graph-convolutional-networks">SimP-GCN:Node Similarity Preserving Graph Convolutional Networks</h4> The solution of SimP-GCN is the following three steps: <ul> <li>Feature similarity adjacent matrix</li> <li>adaptive self-loop</li> <li>Feature-based SSL task</li> </ul> The two channel propogation is mainly based on two graphs, the adjacent matrix and feature KNN matrix. $\mathbf{P}^{(l)}=\mathrm{s}^{(l)} * \tilde{\mathbf{D}}^{-1 / 2} \tilde{\mathrm{A}} \tilde{\mathrm{D}}^{-1 / 2}+\left(1-\mathrm{s}^{(l)}\right) * \mathrm{D}_{f}^{-1 / 2} \mathbf{A}_{f} \mathbf{D}_{f}^{-1 / 2}$ where $\mathbf{P}^{(l)}$ is the propogation matrix, KNN adjacent $A_f$ is defined by feature cosine similarity, select the top K similar nodes as neighborhood. The adaptive combination score is learned as: $\mathbf{s}^{(l)}=\sigma\left(\mathbf{H}^{(l-1)} \mathbf{W}_{s}^{(l)}+b_{s}^{(l)}\right)$ where $\mathbf{W}_{s}^{(l)} \in \mathbb{R}^{d^{(l-1)} \times 1}$ Adaptive self-loop as \[\tilde{\mathbf{P}}^{(l)}=\mathbf{P}^{(l)}+\gamma \mathbf{D}_{K}^{(l)}\] $\mathbf{D}_{K}^{(l)} = diag(K_1^{(l)},K_2^{(l)}, K_n^{(l)} )$ , it is learned from \[K_{i}^{(l)}=\mathbf{H}_{i}^{(l-1)} \mathbf{W}_{K}^{(l)}+b_{K}^{(l)}\] where $\mathbf{W}_{K}^{(l)} \in \mathbb{R}^{d^{(l-1)} \times 1}$ SSL task to preserve the original node feature similarity $\mathcal{L}_{\text {self }}(\mathbf{A}, \mathbf{X})=\frac{1}{|\mathcal{T}|} \sum_{\left(v_{i}, v_{j}\right) \in \mathcal{T}}\left\|f_{w}\left(\mathbf{H}_{i}^{(l)}-\mathbf{H}_{j}^{(l)}\right)-\mathrm{S}_{i j}\right\|^{2}$ where $S_{ij}$ is the cosine node feature similarity between node i and j, SSL is a regression to learn original feature similarity and dissimilarity. <h4 id="h2gcn-beyond-homophily-in-graph-neural-networks-current-limitations-and-effective-designs">H2GCN: Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs</h4> In this paper, they think maybe the self-loop is not a good solution. They design a separate aggregation for ego embedding and neighbor embedding. . They use the combination operation for these two embeddings. The origin faeture is better preserved. $\mathbf{r}_{v}^{(k)}=\operatorname{COMBINE}\left(\mathbf{r}_{v}^{(k-1)}, \operatorname{AGGR}\left(\left\{\mathbf{r}_{u}^{(k-1)}: u \in \bar{N}(v)\right\}\right)\right)$ The neighbor $\bar{N}(v)$ does not include $v$. the combine may follows by non-linear transformation. They point that combine is better for generalization theorical. Theoritical Jusfiction is that Consider a graph G without self-loops with node features <code class="language-plaintext highlighter-rouge">$x_v = onehot(y_v)$</code> for each node <code class="language-plaintext highlighter-rouge">$v$</code>, and an equal number of nodes per class <code class="language-plaintext highlighter-rouge">$y ∈ Y$</code> in the training set <code class="language-plaintext highlighter-rouge">$\mathcal{T}_V$</code> . Also assume that all nodes in <code class="language-plaintext highlighter-rouge">$T_V$</code> have degree d, and proportion h of their neighbors belong to the same class, while proportion <code class="language-plaintext highlighter-rouge">$\frac{1−h} {|Y|−1} $</code>of them belong to any other class (uniformly). Then for <code class="language-plaintext highlighter-rouge">$h < \frac{1−|Y|+2d}{ 2|Y|d}$</code> , a simple GCN layer formulated as <code class="language-plaintext highlighter-rouge">$(A + I)XW$</code> is less robust, i.e., misclassifies a node for smaller train/test data deviations, than a <code class="language-plaintext highlighter-rouge">$AXW $</code> layer that separates the ego- and neighbor-embeddings. TODO: give more intuition on the proof The above two papers are two papers focus on this original feature preserve. Notice that, this componet has been widely used who utilize it as a channel <h3 id="find-differences-with-neighborhood">Find differences with neighborhood</h3> In this part, there are major two methods which take idea from spatial and spectral perspectives. The questions are similar: <ul> <li>What is the role of low-frequency signal and high-frequency signal?</li> <li>How to use the high-filter reduce the hurt from bad heterophily?</li> </ul> <h4 id="fagcnbeyond-low-frequency-information-in-graph-convolutional-network">FAGCN：Beyond Low-frequency Information in Graph Convolutional Network</h4> This paper mainly talk about the low-signal is not enough for learning good representation, as low frequency igonore the difference. good GNN should be able to seperate and capture low-frequency, high-frequency. Then adaptively propogate low-frequency signals, high-frequency signal and raw features with self-gating mechasim. The relation between heterphoily and frequency is as follows: The signal is extracted as: $\mathcal{F}_{L}=\varepsilon I+D^{-1 / 2} A D^{-1 / 2}=(\varepsilon+1) I-L$ \[\mathcal{F}_{H}=\varepsilon I - D^{-1 / 2} A D^{-1 / 2}=(\varepsilon-1) I+L\] where $\epsilon$ is the hyperparameter to balance the low and high frequency: amplifies the low-frequency signals and restrains the high-frequency signals.(also the self loop, see below) multi channel by self-gate: $\tilde{\mathbf{h}}_{i}=\alpha_{i j}^{L}\left(\mathcal{F}_{L} \cdot \mathbf{H}\right)_{i}+\alpha_{i j}^{H}\left(\mathcal{F}_{H} \cdot \mathbf{H}\right)_{i}=\varepsilon \mathbf{h}_{i}+\sum_{j \in \mathcal{N}_{i}} \frac{\alpha_{i j}^{L}-\alpha_{i j}^{H}}{\sqrt{d_{i} d_{j}}} \mathbf{h}_{j}$ $\alpha_{i,j}^L+\alpha_{i,j}^H = 1$ as the two part of the signal. It is learned from $\alpha_{i j}^{G}=\tanh \left(\mathbf{g}^{\top}\left[\mathbf{h}_{i} \| \mathbf{h}_{j}\right]\right)$ <h4 id="acm-gcn-is-heterophily-a-real-nightmare-for-graph-neural-networks-to-do-node-classification">ACM-GCN: Is Heterophily A Real Nightmare For Graph Neural Networks To Do Node Classification?</h4> This paper starts from whether all heterophily is bad for graph and indicate the bad heterophily by analys the BP. The measurement can be found in the measurement part above. They address it with diversification operation, and propose the Adaptive Channel Mixing (ACM) framework. The intuition is that they want to ignore the grey one to be small, high-pass filter (diversification operation) to extract the information of neighborhood differences and address harmful heterophily. The learning is three step, similar to the former one: **Feature Extraction for each channel: ** （extract signal) $H_{L}^{l}=\operatorname{ReLU}\left(H_{\mathrm{LP}} H^{l-1} W_{L}^{l-1}\right), H_{H}^{l}=\operatorname{ReLU}\left(H_{\mathrm{HP}} H^{l-1} W_{H}^{l-1}\right), H_{I}^{l}=\operatorname{ReLU}\left(I H^{l-1} W_{I}^{l-1}\right)$ where $H_{LP} = A$, $H_{LP} = L$ Feature-based weight Learning: (like self gate ) \[\begin{array}{l} \tilde{\alpha}_{L}^{l}=\sigma\left(H_{L}^{l} \tilde{W}_{L}^{l}\right), \tilde{\alpha}_{H}^{l}=\sigma\left(H_{H}^{l} \tilde{W}_{H}^{l}\right), \tilde{\alpha}_{I}^{l}=\sigma\left(H_{I}^{l} \tilde{W}_{I}^{l}\right), \tilde{W}_{L}^{l-1}, \tilde{W}_{H}^{l-1}, \tilde{W}_{I}^{l-1} \in \mathbb{R}^{F_{l} \times 1} \\ {\left[\alpha_{L}^{l}, \alpha_{H}^{l}, \alpha_{I}^{l}\right]=\operatorname{Softmax}\left(\left[\tilde{\alpha}_{L}^{l}, \tilde{\alpha}_{H}^{l}, \tilde{\alpha}_{I}^{l}\right] W_{\text {Mix }}^{l} / T,\right), W_{\text {Mix }}^{l} \in \mathbb{R}^{3 \times 3}, T \in \mathbb{R} \text { is the temperature; }} \end{array}\] multi channel aggregate: \[H^{l}=\left(\operatorname{diag}\left(\alpha_{L}^{l}\right) H_{L}^{l}+\operatorname{diag}\left(\alpha_{H}^{l}\right) H_{H}^{l}+\operatorname{diag}\left(\alpha_{I}^{l}\right) H_{I}^{l}\right)\] Notice that more than FAGCN, this is a framework which can be applied to any GNN model One remain question, what is the difference between A L and L -L for these signal <h3 id="find-more-useful-adjacent-matrix">Find more useful adjacent matrix</h3> Like Geom-GCN says, the behind basic idea is the aggregation on a graph can benefit from a continuous space underlying the graph. where this useful continuous lies? It lies in: <ul> <li>Graph embedding similarity adjacent matrix</li> <li>Structural similarity adjacent matrix (Struct2Vec)</li> <li>Origin Feature similarity adjacent matrix (SimP-GCN)</li> </ul> They can help to capture long-range information which allerviate the problem of deeper GNN: <ul> <li>relevant messages from distant nodes are mixed indistinguishably with a large number of irrelevant messages from proximal nodes in multi-layer MPNNs, which cannot be extracted effectively</li> <li>representations of different nodes would become very similar in multi-layer MPNNs</li> </ul> <h4 id="geom-gcn--geometric-graph-convolutional-networks-graph-embedding-similarity">Geom-GCN: GEOMETRIC GRAPH CONVOLUTIONAL NETWORKS (Graph embedding similarity)</h4> In this paper, they point out <ul> <li>pooling and aggregate will compress the structural information of nodes in neighborhoods. Multi channel to preserve the neighbor structure is neccerary.</li> <li>Caputure long-rang on the geometric continuous latent space. (graph embedding technique to conduct new information)</li> </ul> structural neighborhood is conducted on different Graph Embedding measurement like DeepWalk, IsoMap. The neighbor is defined as $\mathcal{N}(v)=({N_{g}(v), N_{s}(v)}, \tau)$, each $\tau$ is a geometric graph embdding methods with $N_{s}(v)=\left{u \mid u \in V, d\left(\boldsymbol{z}{u}, \boldsymbol{z}{v}\right)<\rho\right}$ , $\rho$ is hyperparameter predefined. Bi-level (multi channel) aggregation is a multi-channel aggregation <ul> <li> Low-level aggregation (aggregate from a kind of measurement, a channel) $\boldsymbol{e}_{(i, r)}^{v, l+1}=p\left(\left\{\boldsymbol{h}_{u}^{l} \mid u \in N_{i}(v), \tau\left(\boldsymbol{z}_{v}, \boldsymbol{z}_{u}\right)=r\right\}\right), \forall i \in\{g, s\}, \forall r \in R$ $p$ is a permutation-invariant function like mean aggregation. </li> <li> High-level aggregation (aggregate the multi-channel representation) $\boldsymbol{m}_{v}^{l+1}=\underset{i \in\{g, s\}, r \in R}{q}\left(\left(\boldsymbol{e}_{(i, r)}^{v, l+1},(i, r)\right)\right)$ $q$ is a pooling operation </li> <li> Non-linear transform </li> </ul> <h4 id="wrgatbreaking-the-limit-of-graph-neural-networks-by-improving-the-assortativity-of-graphs-with-local-mixing-patterns-structural-similarity-">WRGAT:Breaking the Limit of Graph Neural Networks by Improving the Assortativity of Graphs with Local Mixing Patterns (Structural similarity )</h4> This paper first point out global assortativity can not learn the diversity on each node. They propose a new node-level measurement. Then they find the node with low local assortativity can not be learned well from GNN. So they include the structural proximity into consideration to help learning Node-level assortativity: <ul> <li>The origin measurement: random walk with stationary</li> <li>New measurement: random walk with restart</li> </ul> Structural neighborhood: conduct similarly like Struct2Vec. It can generate multiple view of graphs when consideration different hop of papers. The intuition is that, two nodes are similar if they have same degree. They are more similar if there neighbor nodes also has same degree. The structural distance can be computed as: $f_{\tau}(g, h)=f_{\tau-1}(g, h)+\mathcal{D}\left(s\left(\mathcal{N}_{\tau}(g)\right), s\left(\mathcal{N}_{\tau}(h)\right)\right)$ $S_1$ and $S_2$ are two ordered degree sequences and the distance is computed by DTW. The edge weight is computed as $w_{\tau}(g, h)=e^{-f_{\tau}(g, h)}, \quad \tau=0,1, \ldots T$ which $\tau$ means multi hop neighbor. Multi channel aggregation: $\boldsymbol{m}_{u}=\sum_{\tau \in \mathcal{R}} \sum_{v \in \mathcal{N}_{1}^{\tau}(u)} \boldsymbol{W}_{\tau} \boldsymbol{h}_{v} w_{\tau}(u, v) \alpha_{\tau}(u, v)$ where $\alpha$ is a GAT like attention. $\boldsymbol{e}_{u, v}^{\tau}=a_{\tau}\left(\boldsymbol{W}_{\tau} \boldsymbol{h}_{u}, \boldsymbol{W}_{\tau} \boldsymbol{h}_{v}\right)=\boldsymbol{a}^{T}\left[\boldsymbol{W}_{\tau} \boldsymbol{h}_{u} \| \boldsymbol{W}_{\tau} \boldsymbol{h}_{v}\right]$ \[\alpha_{\tau}(u, v)=\frac{\exp \left(\operatorname{LeakyReLU}\left(e_{u, v}^{\tau}\right)\right)}{\sum_{x \in \mathcal{N}_{1}^{\tau}(u)} \exp \left(\operatorname{LeakyReLU}\left(e_{u, x}^{\tau}\right)\right)}\] <h4 id="query-attention-nl-mlpnon-local-graph-neural-networks">Query Attention (NL-MLP):Non-Local Graph Neural Networks</h4> This paper want to go beyond the local aggegation is harmful for disassortative graphs. It is necceary to have **non-local aggregation to capture long-range dependencies(nodes with the same label are distant from each other) with attention query. ** Local Embedding: a local embedding which can be a one-layer GNN or just MLP as a feature extractor. Attention-guide： learn an attention score for each local embedding with a learnable query $a_{v}=\operatorname{ATTEND}\left(c, z_{v}\right) \in \mathbb{R}, \forall v \in V$ where $c$ is a calibration vector that is randomly initialized and jointly learned during training. This procedure, make the distance but similar nodes together. Non-local aggregation: Then node can be sorted with the attention score, and form like a sequence. A 1D conv is used for neighbor feature extraction. <h3 id="deeper-gnn-for-larger-receptive-field">Deeper GNN for larger receptive field</h3> In the former section, we says that relevant messages from distant nodes are mixed indistinguishably with a large number of irrelevant messages from proximal nodes in multi-layer MPNNs, which cannot be extracted effectively. So is it any method can direct use the distant nodes without mixed up a lot? As the former section with more useful adjacent nodes is built on much assuption and will certain lose some information. The question is how to build deeper GNN without oversmooth and capture more important nodes? Here we will only talk about two typical methods: H2GCN and GPR-GCN. Methods like AAPNP and GCNII will be discussed in my next blog. <h4 id="h2gcn-beyond-homophily-in-graph-neural-networks-current-limitations-and-effective-designs-1">H2GCN: Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs</h4> We have introduced this method first, here we introduce two key component in it. Higher-order Neighborhoods This design opinion is siimlary with GCN-Cheby and MixHop, which is: $\mathbf{r}_{v}^{(k)}=\operatorname{COMBINE}\left(\mathbf{r}_{v}^{(k-1)}, \operatorname{AGGR}\left(\left\{\mathbf{r}_{u}^{(k-1)}: u \in N_{1}(v)\right\}\right), \operatorname{AGGR}\left(\left\{\mathbf{r}_{u}^{(k-1)}: u \in N_{2}(v)\right\}\right), \ldots\right)$ The key on this idea is that: maybe one-hop neighbor is homophily, but the two-hop neighbor is homophily. Just find them. The theoritical proof as follows Theorem 2 Consider a graph <code class="language-plaintext highlighter-rouge">$\mathcal{G}$</code> without self-loops with label set <code class="language-plaintext highlighter-rouge">$\mathcal{Y}$</code> , where for each node v , its neighbors’ class labels <code class="language-plaintext highlighter-rouge">$\left\{y_{u}: u \in N(v)\right\}$</code> are conditionally independent given <code class="language-plaintext highlighter-rouge">$y_{v}$</code> , and <code class="language-plaintext highlighter-rouge">$P\left(y_{u}=\right. \left.y_{v} \mid y_{v}\right)=h$</code>, <code class="language-plaintext highlighter-rouge">$P\left(y_{u}=y \mid y_{v}\right)=\frac{1-h}{|\mathcal{Y}|-1}, \forall y \neq y_{v} $</code>. Then, the 2 -hop neighborhood <code class="language-plaintext highlighter-rouge">$N_{2}(v)$</code> for a node <code class="language-plaintext highlighter-rouge">$v$</code> will always be homophily-dominant in expectation. Combination of Intermediate representations The design opinion is similar with the paper we will introduce later: GPR-GCN, also like JKNet. The key is to select the most informative layer representation, adapting to different neighborhood range structural properties. $\mathbf{r}_{v}^{(\text {final })}=\operatorname{COMBINE}\left(\mathbf{r}_{v}^{(1)}, \mathbf{r}_{v}^{(2)}, \ldots, \mathbf{r}_{v}^{(K)}\right)$ However, this combination is somehow naive, in the next paper, we will detail into how to do this layer selection. <h4 id="gprgnn-adaptive-universal-generalized-pagerank-graph-neural-network">GPRGNN: ADAPTIVE UNIVERSAL GENERALIZED PAGERANK GRAPH NEURAL NETWORK</h4> This paper is to build a GNN with Generalized PageRank(GPR) which can adaptively learn different node label pattern, in this way, they can not only deal with the naive homophily, but also the complex heterphily pattern. The main approach is that GPR can learn a weight for each layer like a step of random walk. The framework is just like a layer selection: The GPR at some natural number $K$, $\sum_{k=0}^{K} \gamma_{k} \tilde{\mathbf{A}}_{\mathrm{sym}}^{k}$ is actual corresponds to lean optimial polynomial graph filter. I will revise the pagerank algorithm then for a detailed discussion We can find the heterphoily needs more step to becom stable, which means large-step propagation is indeed of great importance for homophilic graphs. The model structure is like $\hat{\mathbf{P}}=\operatorname{softmax}(\mathbf{Z}), \mathbf{Z}=\sum_{k=0}^{K} \gamma_{k} \mathbf{H}^{(k)}, \mathbf{H}^{(k)}=\tilde{\mathbf{A}}_{\mathrm{sym}} \mathbf{H}^{(k-1)}, \mathbf{H}_{i:}^{(0)}=f_{\theta}\left(\mathbf{X}_{i:}\right)$ Notice that, it escapes from the non-linear activation function, where the each hidden state corresponds to the one-step pagerank. GPR-GNN has the ability to adaptively control the contribution of each propagation step and adjust it to the node label pattern. <h2 id="what-is-the-correlation-with-other-problems-in-graph">What is the correlation with other problems in graph</h2> TODO on a summary on is homophily neccerary and two coins <h2 id="questions">Questions</h2> <ul> <li>What is the relationship between mixup learning and heterphoily graph？</li> <li>It seems that still no method to judge a graph without its label to measure it is homophily or not. A unify method may be of great need？</li> <li>Is dissimilar feature really a drawback for heterphoily graph or for GNN?</li> </ul> <h1 id="paperlist">PaperList</h1> <ul> <li>Powerful Graph Convolutioal Networks with Adaptive Propagation Mechanism for Homophily and Heterophily （AAAI2022）</li> <li>GBK-GNN: Gated Bi-Kernel Graph Neural Networks for Modeling Both Homophily and Heterophily (WWW2022)</li> <li>Graph Neural Networks Inspired by Classical Iterative Algorithms (ICML2021)</li> <li>Beyond Low-frequency Information in Graph Convolutional Networks (AAAI2021)</li> <li>ADAPTIVE UNIVERSAL GENERALIZED PAGERANK GRAPH NEURAL NETWORK (ICLR2021)</li> <li>Breaking the Limit of Graph Neural Networks by Improving the Assortativity of Graphs with Local Mixing Patterns (KDD2021)</li> <li>A Real Nightmare For Graph Neural Networks To Do Node Classification? (Preprint)</li> <li>Beyond Low-frequency oneIs Homophily a Necessity for Graph Neural Networks? (ICLR2022)</li> <li>TWO SIDES OF THE SAME COIN: HETEROPHILY AND OVERSMOOTHING IN GRAPH CONVOLUTIONAL NEURAL NETWORKS (Preprint)</li> <li>Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods (NeurIPS2021)</li> <li>New Benchmarks for Learning on Non-Homophilous Graphs (WWW2021)Node Similarity Preserving Graph Convolutional Networks (WSDM2021)</li> <li>Simple and Deep Graph Convolutional Networks (ICML2020)5Non-Local Graph Neural Network (TPAMI)</li> <li>Graph Neural Networks with Heterophily (AAAI2020)</li> <li>Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs (NeurIPS2020)</li> <li>GEOM-GCN: GEOMETRIC GRAPH CONVOLUTIONAL NETWORKS (ICLR2020)</li> <li>MEASURING AND IMPROVING THE USE OF GRAPH INFORMATION IN GRAPH NEURAL NETWORKS (ICLR2020)</li> </ul> <h1 id="useful-sentences">Useful sentences</h1> However, the typical GCN design that mixes the embeddings through an average [17] or weighted average [36] as the COMBINE function results in final embeddings that are similar across neighboring nodes (especially within a community or cluster) for any set of original features [28]. While this may work well in the case of homophily, where neighbors likely belong to the same cluster and class, it poses severe challenges in the case of heterophily: it is not possible to distinguish neighbors from different classes based on the (similar) learned representations. The aggregators lack the ability to capture long-range dependencies in disassortative graphs. In MPNNs, the neighborhood is defined as the set of all neighbors one hop away (e.g., GCN), or all neighbors up to r hops away (e.g., ChebNet). In other words, only messages from nearby nodes are aggregated. The MPNNs with such aggregation are inclined to learn similar representations for proximal nodes in a graph. This implies that they are probably desirable methods for assortative graphs (e.g., citation networks (Kipf & Welling, 2017) and community networks (Chen et al., 2019)), where node homophily holds (i.e., similar nodes are more likely to be proximal, and vice versa), but may be inappropriate to the disassortative graphs (Newman, 2002) where node homophily does not hold. For example, Ribeiro et al. (2017) shows disassortative graphs where nodes of the same class exhibit high structural similarity but are far apart from each other. In such cases, the representation ability of MPNNs may be limited significantly, since they cannot capture the important features from distant but informative nodes. The homophily principle (McPherson et al., 2001) in the context of node classification asserts that nodes from the same class tend to form edges. Homophily is also a common assumption in graph clustering (Von Luxburg, 2007; Tsourakakis, 2015; Dau & Milenkovic, 2017) and in many GNNs design (Klicpera et al., 2018). Methods developed for homophilic graphs are nonuniversal in so far that they fail to properly solve learning problems on heterophilic (disassortative) graphs (Pei et al., 2019; Bojchevski et al., 2019; 2020). In heterophilic graphs, nodes with distinct labels are more likely to link together (For example, many people tend to preferentially connect with people of the opposite sex in dating graphs, different classes of amino acids are more likely to connect within many protein structures (Zhu et al., 2020) etc). GNNs model the homophily principle by aggregating node features within graph neighborhoods. For this purpose, they use different mechanisms such as averaging in each network layer. Neighborhood aggregation is problematic and significantly more difficult for heterophilic graphs (Jia & Benson, 2020) </article> <article> <h1>Source Free Unsupervised Graph Domain Adaptation</h1> 2021-12-09T00:00:00+00:00 <h1 id="source-free-unsupervised-graph-domain-adaptation">Source Free Unsupervised Graph Domain Adaptation</h1> <h3 id="content">Content</h3> In this blog, we will not only talk about our paper, but also give a brief introduction on the domain adaptation for those who are not familiar with it. If you are familar with DA, it’s ok to start from the section 4 <ol> <li>What is Domain Adaptation and its practice use?</li> <li>The traditional methods in domain adaptation.</li> <li>Brief introduction on domain adaptation methods in CV.</li> <li>Why we need source free unsupervised graph domain adaptation? The key challenge and our solution.</li> <li>Experiments</li> <li>Conclusion and future work</li> </ol> <h3 id="what-is-domain-adaptation-and-its-practice-use">What is Domain Adaptation and its practice use?</h3> To give you a more precise description, here we find the definition in Wikipedia. <blockquote> Domain adaptation(DA) is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution. For instance, one of the tasks of the common span filter problem consists in adapting a model from one user (the source distribution) to a new user who receives significantly different emails (the target distribution). Domain adaptation has also been shown to be beneficial for learning unrelated sources. </blockquote> So study the domain adaptation is just like the domain So the most important characteristics in DA are as follows: <ul> <li>Source domain are well-labeled and target domain are unlabeled</li> <li>Difference: There exists the domain gap between source and target domain. <ul> <li>covariate shift(marginal distribution shift): the most common domain gap where <code class="language-plaintext highlighter-rouge">$ P(X_s) \ne P(X_t) $</code>and <code class="language-plaintext highlighter-rouge">$ P(Y|X_s) = P(Y|X_t) $</code> which is the prior knowledge we assume in the proof of this paper.</li> <li>target shift (conditional distribution shift): <code class="language-plaintext highlighter-rouge">$ P(X_s) = P(X_t) $</code> and <code class="language-plaintext highlighter-rouge">$ P(Y|X_s) \ne P(Y|X_t)$</code></li> <li>Joint distribution shift: <code class="language-plaintext highlighter-rouge">$P(X_s) \ne P(X_t)$</code> and <code class="language-plaintext highlighter-rouge">$P(Y|X_s) \ne P(Y|X_t)$</code> which is the most difficult one with no assumption.</li> </ul> </li> <li>Related: The task is related, which means labels in the source domain and the labels in the target domain are the same. Here we need to clarify its difference between the recent popular pretrain (specifically talking about graph). The key is related <ul> <li>Methods in graph pretrain first use unlabeled graph to learn a good representation for different tasks, then they use the labeled data for the specific downstream task to finetune the parameters. The key of pretrain is enhances a model with labeled data by leveraging additional knowledge from unlabeled data.</li> <li>Methods in the proposed DA first use labeled source graph to learn a discriminative model and adapt it to the unlabeled target graph. With no label in the target domain, it is a harder problem than pretraining. The key of DA is an unlabeled classification problem by leveraging the relation information between node features and labels learned from a labeled source data.</li> <li>The challenges and required techniques are entirely different for these two scenarios.</li> </ul> </li> </ul> To concrete the difference, we will introduce some real-world dataset for you to fully understand what is the domain gap. One thing needs to be mentioned that, we have not found any mathmatical metric to evaluate the domain gap, most of the domain gaps just come from the perceptual knowledge. Taking an example in CV (VisDA-C dataset) Our target is to recognize the plane in the real-world. However,it is hard to manually collected so many images (with different angels) and tried to label them. But it is quite easy to form the synthestic plane with the technology of graphics. Here comes the domain gap! The real image and the synthestic image are quite different but they are all planes. Oh, we can say that they are two different but related tasks. We will not mentioned more example in CV, like the change of the angle or different backgrounds. This is a research topic of great practice value. Then we take some examples in graph to see this pheonmon in the graph domain to stress the scenario of practice use. The first scenario happens when a new plaform is built or the new incoming data. We all know that ACM and DBLP are well-labeled database in citation network. If we would like to develop some new databases that might not be labeled yet, such as Aminer, based on the resources in the existing databases. The domain gap exists because different platforms have their own interest. Methods in UGDA scenario could help the transferring in an unsupervised manner. The second scenario happens in the ACM platform, which is a large platform of papers. However, we found that the domain gap even exists in different subgraphs with different structures even though the features come from the same distribution. A dense graph with 1500 nodes and 4960 edges, and a sparse graph also with 1500 nodes but with 759 edges. We can found the model trained on one graph can hardly achieve a good performance on another one. Similar observation can also be found in <a href="https://arxiv.org/pdf/2006.15643.pdf">Investigating and Mitigating Degree-Related Biases in Graph Convolutional Networks</a> Methods in UGDA can help to enhance the performance on label-scare subgraph. <h3 id="the-traditional-methods-in-domain-adaptation">The traditional methods in domain adaptation.</h3> Revolving on the above goal to mitagate the domain gap, we first give some traditional methods and definitions as they still have some treasures we can learn from. And there are two major methods in our discussion: <ul> <li>Instance-based transfer learning approaches</li> <li>Feature-based transfer learning approaches</li> </ul> <h4 id="instance-based-transfer-learning-approaches">Instance-based transfer learning approaches</h4> The intuition of the instance-based methods is that we can identify the importance of an source example is in the feature distribution of the target domain. problem Setting: Given <code class="language-plaintext highlighter-rouge">$D_S= \left \{ x_{S_i}, y_{S_i} \right \}^{n_S}_{i=1}$, $D_T=\left \{ x_{T_i} \right \}^{n_T}_{i=1}$</code> The goal is to learn $f_T$, $s.t. \sum_i \epsilon (f_T(x_{T_i}),y_{T_i}) $ is small where $y_{T_i}$ is unknown. The assumptions are: <code class="language-plaintext highlighter-rouge">$ \mathcal{Y}_{S}=\mathcal{Y}_{T},$</code> and <code class="language-plaintext highlighter-rouge">$P(Y_S|X_S)=P(Y_T|X_T) $</code> <code class="language-plaintext highlighter-rouge">$\mathcal{X}_{S}=\mathcal{X}_{T},$</code> $P(X_S)\ne P(X_T)$ So the solution will be: $\theta^* = arg \min{\mathbb{E_{(x,y)\sim P_T}}[l(x,y,\theta)]}$ \[= arg \min{\mathbb{E}_{(x,y)\sim P_T} \left [ \frac{P_S(x,y)}{P_T(x,y)} l(x,y,\theta) \right ]}\] \[= arg \min{\int_{y}\int_{x}P_T(x,y) \left ( \frac{P_S(x,y)}{P_S(x,y)} l(x,y,\theta) \right )}dxdy\] \[=arg \min{\int_{y}\int_{x}P_S(x,y) \left ( \frac{P_T(x,y)}{P_S(x,y)} l(x,y,\theta) \right )}dxdy\] \[= arg \min{\mathbb{E}_{(x,y)\sim P_S} \left [ \frac{P_T(x,y)}{P_S(x,y)} l(x,y,\theta) \right ]}\] Denote that $\beta(x)=\frac{P_T(x)}{P_S(x)}$ which can be viewed as the weight on each instance which describe the probability of the souce example appearing in the target domain to reduce the domain shift. <h4 id="feature-based-transfer-learning-approaches">Feature-based Transfer Learning Approaches</h4> The intuition from the feature based domain is that the source and target domains have some overlapping features. (features only have support in either the source or the target domain). The feature-based methods aim to learn a mapping function $ \varphi$ to enhance the overlapping features. In other word, to stress the importance of the overlapping features and ignoring the distuiguished features. An example is the following image. Feature-based methods are the most common used in DA. So we will discuss some typicial methods: <ul> <li>Feature Augmentation (FAM model)</li> <li>Transfer Component Analysis (TCA model)</li> </ul> <h5 id="fam-model">FAM model</h5> FAM is just a feature argumentation. The mathematic formulation of the model is: $\Phi_S(x_i^S)=\left \langle x_i^S, x_i^S, 0 \right \rangle , \Phi_T(x_i^T)=\left \langle x_i^T, 0, x_i^T \right \rangle$ The key steps are: <ul> <li>replicate the features for 3 times: General feature, Source features, Target features</li> <li>For the transformation of the source domain: target features set to 0</li> <li>For the transformation of the target domain: source features set to 0</li> </ul> They hope the weight corresponding to the general feature in the first part can learn the overlapping features while others learn something distuiguishing. However, I cannot totally agree with these methods for most of the features are low-rank, and there exists too many solutions. <h5 id="transfer-component-analysis">Transfer Component Analysis</h5> Its intuition is similar with the PCA. However, PCA aims to learn a low dimension representation to best preserve the information in the feature. And the PCA aims to learn a low dimension representation to best preserve the similar features on the two domains. It seems a little hard for me to explain it well intuitively, if there is any problem, contact with me or read the whole paper. <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5640675">link</a> Maximum Mean Discrepancy (MMD) As we know that $X^S= { x_i^s}$ and $X^T= { x_i^t}$ are from different feature distribution. Then how to use the sample to estimate the distance between the distribution. Unlike the KL divergence or other methods which nees some hyperparameter or the density estimation. MMD is based on the reproducing Hilbert space (RKHS) with no need on any parameter. $Dist(X^S,X^T)=MMD(X^S,X^T)=\left \| \frac{1}{n_s}\sum_{i=1}^{n_s}\phi (x^s_i) - \frac{1}{n_t}\sum_{i=1}^{n_t}\phi (x^t_i) \right \| ^ 2 _ {\mathcal{H} }$ $\phi$ is the kernel function (similar with the kernel in SVM), the sample are projects into RKHS space and calculate the distance between their average values. Notice that, the kernel of TCA is chosen manually. The $\phi$ should: <ul> <li>The distance between the marginal distributions $ P(\phi(X_S))$ and $P(\phi(X_T))$ is small.</li> <li>$\phi(X_S)$ and $\phi(X_T)$ preserve important properties of $X_S$ and $X_T$</li> </ul> minimize the distance between the two distribution. Hilbertt-Schmidt Independence Criterion With the distance metric, I think we can find the dependence with the metric. It computes the Hilbert-Schmidt norm of a cross-covariance perator in the RKHS. The more depedent feature is, the more neccerary we should preserve it. $HISC(X,Y) = \left ( \frac{1}{(n-1)^2 tr(HKHK_{tt})} \right )$ where $K$, $K_{tt}$ are the kernel matrix in the source and target domain. $H=I-(\frac{1}{n}11^T)$ is a centering matrix. If it is equal to 0, it means they are independent. The value is larger means the feature are more dependent. Based on the good information preserving of the HISC property, it is often used as a metric in dimensionality reduction. It is often desirable to preserve the local data geometry while aligning with the side information. In our case, it is the unlabeled data in target domain. Mathematically, the problem will be: $max_{\mathcal{K \succeq 0}} tr(HKHK_{tt}) s.t. K$ $k_{tt}$ is computed by the information from the target domain. MMDE We revisit a traditional method to generate embedding. Similar with the PCA, first we need to compute the similarity between each two instances. However, we will give different weight on nodes from different domains. The Gram matrices is defined as: $K = \begin{pmatrix} K_{S,S} & K_{S,T}\\ K_{T,S} & K_{T,T} \end{pmatrix} \in \mathbb{R}^{(n_1+n_2)\times(n_1+n_2)}$ where $K=\left [ \phi(x_i)^T \phi(x_j)^T \right ]$, $\phi$ is defined on all the data and the MMD distance can be writtern as: $tr(KL)$. $L_{ij}=\frac{1}{n_s^2}$ if $x_i, x_j \in X_S$. $L_{ij}=\frac{1}{n_t^2}$ if $x_i, x_j \in X_T$ or $L_{ij}=\frac{1}{n_sn_t}$ And the solution can be written with the constraint of $K$. $\min_{K \succeq 0} tr(KL)-\lambda tr(K)$ where thre first term is the object to minimize the distance between distribution, while the second maximizes the variance in the feature space. TCA To avoid solving this large and dense matrix and investigate more correlation in an inductive way, TCA is proposed. <ol> <li>decompose the kernel matrix $K$ into $K=(KK^{-\frac{1}{2}})(K^{-\frac{1}{2}}K)$, a dual form</li> <li>A matrix factorize is proposed then. $\tilde{K}=(KK^{-\frac{1}{2}}\tilde{W})(\tilde{W}^TK^{-\frac{1}{2}}K)=KWW^TK$, where $W=K^{-\frac{1}{2}\tilde{W}}$. The operation is simplified by the kernel trick.</li> <li>As the optimization object is the trace of the matrix, the result can be rewrite: $tr((KWW^TK)L)=tr(KWLW^TK)$</li> </ol> Then the optimization object is: $\min_W tr(W^TKLKW) + \mu tr(W^TW) s.t.W^TKHKW=I_m$ The first term is to minimize the discrepancy in the distributions, and the second term is to maxmize the variance to preserve the diversity in data. The constraint is the variance of the projected samples. Further simplify this problem, the solution is just the top $m$ singular value of the matrix $(KLK+\mu I)^{-1}KHK$ <h5 id="a-simple-conclusion">A simple conclusion</h5> From the empical perspective, I think the reason for $P(X_S) \ne P(X_T)$ come from two perspectives. <ul> <li>The difference of sample probabilities. one class may appear many times in the source domain and seldom appear in the target domain. In this case: instance-based method may help.</li> <li>The difference of feature distribution. like the above example of plane and plane model. In this case: feature-based method may help.</li> </ul> I think it’s really worthy to learn these traditional methods. I am still thinking how it can help to our understanding today. <h3 id="brief-introduction-on-domain-adaptation-methods-in-cv">Brief introduction on domain adaptation methods in CV.</h3> In the above section, we introduce how to mitigate the domain gap mainly from the dimension reduction perspective. However, with the era of the deep learning incoming, rigorous mathmetic proof and careful design seems to be useless as an loss item can help us to do most of the things we want. (We do not introduce the graph domain adaptation methods in details here, for they share the spirit) After reading a large number of papers, we conclude these methods from three perspectives: <ul> <li>Feature-level transfer methods (Feature-based transfer)</li> <li>Label-level transfer methods (instance-based transfer)</li> <li>Model-level transfer methods</li> </ul> To clarify again for our problem: <ul> <li>Inputs are the souce features, the source labels and the target feature</li> <li>Ouput should be a model performs well on the target domain</li> </ul> Feature-level transfer methods The feature level methods usually have joint training framework with a shared encoder and a classifier (maybe have some additional model component). <ul> <li>A feature encoder is trained to align the feature distributions between the source domain and the target domain for mitigating the domain gap. After encoding, the features (domain invariant representation) is so similar that we can not be distinguished that which domains it comes from.</li> <li>As the domain gap vanish, a classifier is trained on the encoded features with the cross entropy loss, supervised by source labels. The classifier can also generate well on the target domain.</li> </ul> With different training procedures, we can further divide these methods into two categories: <ul> <li>Distance based methods</li> <li>Domain adversarial methods</li> </ul> Distance based methods is most close to the traditional feature-based transfer methods. They also incorporate maximum mean discrepancy (MMD) loss on the hidden representation of a particular layer or some layers. It acts as a domain distance loss to match the distribution sttisitcal momonts at different orders. Here we show an example of the classical method: DDC Dotted line means the weights are shared. We can see that it has a shared seven-layer encoder with a one-layer classifier. Domain adversarial method Domain adversarial methods have an additional component called domain classifer. Also, they generate the domain label. The additional task of the domain classifer is to recognized which domain the sample comes from. And the feature encoder will fight against it! It will try to generate feature looks like from the similar distribution to confuse the domain classifier. As the training moving on, we result in a good classifier and also a good encoder generated aligned feature. Here we show an example of the classical method: DANN. Notice that the gradient will reverse comes out from the domain classifier, which means to oppose the sign of the gradient. It is easy to understand why design like this. oppose the sign of the gradient is the same with oppose the sign of the loss, where the domain classifier aims to learn to be more discriminative. To the opposite, the encoder wants to generate more similar feature, so it is natural to have such methods. Label transfer methods The label transfer method is some kinds of connected with the instance-based methods since they generate the pseudo label for each instance. These methods is to take advantage of the well-trained source model to generate pseudo-label for target samples based on the maximum posterior probability. The pseudo labels are then used to supervise the training of target model. The key of these methods is to design some rules to generate the convincing label (most of them are true label). The new pseudo label can be used as the true label to further train on the target domain. This method have some potential limitations like: <ul> <li>The priori knowledge of these methods are the model trained on the source data already has primary discriminative ability. In most case, the source model can have an accuracy of 60% without adaptation. It is somehow good but not well enough.</li> <li>If the rule is not designed well and generate many bad pseudo label (not correct), it may indeed lead to the failure of the training. The bad pseudo label will give bad guidance to generate even worse pseudo label.</li> </ul> Then we take asymmetric tri-training methods as an example The rule of this model is voting. F1 and F2 classifer are to generate the pseudo label and the $F_t$ is to learn to classify well on the target domain. The voting procedure is as follows: <ul> <li>F1 and F2 are firstly trained on the source domain, A regularization term is added to ensure that the weight of F1 and F2 are orthogonal. This is to make sure that they learn some complementary knowledge.</li> <li>Then they will predict on the target domain. The rule to generate the psuedo label is <ul> <li>Both F1 and F2 agree with the same prediction.</li> <li>one of them is really confident with the result, which gives the 95% confidence</li> </ul> </li> <li>Use the pseudo label to further train the network</li> <li>The model will be more confident on the target domain and do the aboves thing again.</li> </ul> Model-level transfer methods Model-level transfer methods are to finetune the source model to conduct a target network. It is also used as an auxiliary technique in many models. The weight indeed matters a lot and the potential of model weights should been further studied. <h3 id="why-we-need-source-free-unsupervised-graph-domain-adaptation-the-key-challenge-and-our-solution">Why we need source free unsupervised graph domain adaptation? The key challenge and our solution.</h3> After so many efforts to introduce the domain adaptation, now it is the time to introduce the new scenario proposed in our paper. I think without deep understanding on the related work. It is hard to have a good idea. <h4 id="preliminary">Preliminary</h4> We first give some mathematics definitions on graph and our task for a clear and elegant expression. A graph is defined as: <code class="language-plaintext highlighter-rouge">$G=(V,E,X,Y)$</code>, <code class="language-plaintext highlighter-rouge">$V=(v_1,\cdots, v_n)$</code> is the node set with $n$ nodes and $E$ is the edge set. $X$ is the node feature where $Y$ is the node label. The model we used can be expressed as a conditional probablity <code class="language-plaintext highlighter-rouge">$ \mathcal{Q} (Y|G ; \theta)$</code>. To better express the model for each node and considering about the design of the GNN, we decompose the model as: <code class="language-plaintext highlighter-rouge">$\mathcal{Q}(Y|G;\theta)={ \prod_{v_i \in V}q(y_i|x_i,\mathcal{N}_i;\theta)}$</code> where <code class="language-plaintext highlighter-rouge">$\mathcal{N}_i$</code> is the neighboorhood of the node $i$, $q$ is the conditional probability for each node. Then the unsupervised graph domain adaptation UGDA can be expressed in a mathematic form. So the inputs are a labeled source graph $G_s=(V_S,E_S,X_S,Y_s)$ and an unlabeled target graph $G_T=(V_T,E_T,X_T)$. The output is <code class="language-plaintext highlighter-rouge">$\mathcal{Q}(Y|G;\theta_t)$</code> a model with good performance on the target domain. <h4 id="problems-and-challenge">Problems and challenge</h4> The key problem of the existing UGDA methods is that they heavily rely on the source graph data $G_s=(V_S,E_S,X_S,Y_s)$. During the adaptation, UGDA needs: <ul> <li>$X_S$ and $E_S$ to mitigate the domain gap with $X_T$ and $E_T$ and generate aligned features by a GNN encoder.</li> <li>$Y_S$ as the supervision signal to learn a discriminative classifier on the aligned features.</li> </ul> So the key challenges is that if we cannot access the source graph when doing adaptation, all the UGDA methods are not able to work any more. To describe the key challenges correponding to the above bullet are: <ul> <li>How can the model adapt well to the shifted target data distribution without accessing the source graph $X_S$ and $E_S$ for aligning the feature distributions.</li> <li>How to enhance the discriminative ability of the source model without accessing source labels $Y_S$for supervision.</li> </ul> The practical value: This scenario is frequently appear in the domain adaptation because of the privacy problems which is becoming more and more important in recent years. For example, if the source and target graph are from two different platforms, and there are some sensitive attributes in the source graph. This may lead to the sereve data leakage problems. To deal with the above issues, we propose a new scenario called Source Free Unsupervised Graph Domain Adaptation (SFUGDA) with no need to access the source graph in the adaptation procedure. The entire training framework will be the following two stages: <ul> <li>The unaccessible source training procedure <ul> <li>input: labeled source graph $G_S=(V_S,E_S,X_S,Y_s)$</li> <li>output: a well-trained source model: $\mathcal{Q}(Y|G;\theta_S)$</li> <li>Notice that: this procedure is totally unaccessible in the SFUGDA scenario. We can not determine neither the model architecture nor what optimizer to use. So it is hard to define what is a well-trained model. In the experiment of this paper, we think the well-trained model is the model with the best validation performance on the source domain.</li> </ul> </li> <li>The adaptation procedure. <ul> <li>input: unlabeled target graph <code class="language-plaintext highlighter-rouge">$G_T=(V_T,E_T,X_T)$</code> well-trained source model <code class="language-plaintext highlighter-rouge">$\mathcal{Q}(Y|G;\theta_S)$</code></li> <li>output: well-trained target model <code class="language-plaintext highlighter-rouge">$\mathcal{Q}(Y|G;\theta_T)$</code></li> <li>Notice that: as we cannot decide the model architecture, so the algorithm in the adaptation procedure cannot have any specific design to particular model component like BatchNorm or WeightNorm. In other word, the algorithm should be model agnostic</li> </ul> </li> </ul> <h4 id="solution">Solution</h4> To deal the above challenges, we design a model agnostic algorithm for SFUGDA method called SOurce free domain Graph Adaptation algorithm (SOGA). にじげん! SOGA has two components: <ul> <li>Structure Consistency (SC) optimization object: adapt on the shifted target data distribution by leavaging the structure information $E_T$ rather than align with $X_S$ and $E_S$.</li> <li>Information Maximization (IM) optimization object: further enhance the discriminative ability of the source model without access to $Y_S$. We theoretically prove that IM loss can enhance the lower bound of the target performance of the source model.</li> </ul> Then the whole training architecture is: We will introduce those two optimization objects in detail next. <h5 id="information-maximzation-im-optimization-object">Information Maximzation (IM) Optimization Object</h5> The target of designing this object is based on the following : <ul> <li> The well-trained source model has primary discriminative ability </li> <li> The adaptation is an unsupervised learning without any directly supervised loss </li> </ul> As we do not have supervised signal, so the target should be: <ul> <li>the unsupervised loss should first keep the origin performance of the source model，do not make the result worth</li> <li>then if possible, try to enhance the performance.</li> </ul> As a result, we design the IM optimization object to improve the lower bound of the performance. The IM optimization object can be written as: $\mathcal{L}_{IM}=MI(V_t,\hat{Y}_t)=-H(\hat{Y}_t|V_t) + H(\hat{Y}_t)$ where $\hat{Y}_t$ is the prediction on the target domain, $\mathbf{V_t}$ is information of input nodes containing node feature $\mathbf{X_t}$ and information from node neighbor $\mathbf{\mathcal{N}_t}$. $MI(\cdot, \cdot)$ is the mutual information, and $H(\cdot)$ and $H(\cdot | \cdot)$ are entropy and conditional entropy, respectively. So the objective can be divided into two parts, one is to minimize the conditional entropy and the other is to maximize the entropy of the marginal distribution of $\mathbf{\hat{Y}_t}$. The conditional entropy is to enhance the confidence of the prediction. (one hot prediction is optimal). This term can theoretically enhance the discriminative ability. (See the lemmas in paper if you like). The final implement form of the conditional entropy is: $H(\mathbf{\hat{Y}_t} | \mathbf{X_t}) = \mathbb{E}_{x_i \sim p_t(x)} \left[- \sum_{y = 1}^k q(y|x_i, \mathbf{\mathcal{N}_i}; \Theta) \log q(y | x_i, \mathbf{\mathcal{N}_i}; \Theta)\right]$ Entropy of Marginal Distribution is to avoid to concentrating to predictions on on one category. $H(\mathbf{\hat{Y}_t}) = - \sum_y q(y) \log q(y), \\ \it{where}\; q(y) = \; \mathbb{E}_{v_i \sim p_t(v)} \left[q(y|x_i, \mathbf{\mathcal{N}_i}; \Theta) \right].$ <h5 id="strcuture-consistency-sc-optimization-object">Strcuture Consistency (SC) Optimization Object</h5> This is also the key point of the domain adaptation with graph properties. Here we want to compare our methods with the Source free Unsupervised Domain Adaptation method in CV from the problem perspective to the experiment perspective. Different between SFUGDA and source free methods in CV From the problem perspective, the main reason is that, the instances in the image dataset are i.i.d.. However, the samples in the graph dataset are naturally structured by dependencies. This dependence (structure) contains much information, for example the homophily property. This gives additional challenge that even two graph have the exact same distributions, it will still suffer from the domain gaps of different graphs. And it is also an advantage of the graph data. It seems natural for the graph data to do some unsupervised learning like the graph embedding methods, they just utilize the graph structure can learn an informative embedding. Though we cannot directly align the feature distribution and find the good absolute position, the structure somehow reveal the relative position between node pairs, which may help to adaptation. From the experiment perspective, we also found there are some source free domain adaptation methods in CV domain. We reimplement them (using GNN as the backbone instead of CNN). However, none of them achieve results. I think the main problem is that: <ol> <li>Most of them belong to the label transfer method which uses the pseudo label to guide the training.</li> <li>However, when they generate the label, they do not concern much about the graph structure, for example, the homophily.</li> <li>We test the label smoothness (how homophily the label is). The smoothness of the pseudo label is much smaller than the ground truth.</li> <li>The wrong pseudo label lead to training into the wrong direction and become worth and worth at last.</li> </ol> As a result, we think it is quite neccerary for us to introduce the graph consistency constraint. SC design SC is based on two hypothesesn on graph structure <ol> <li>The probability of sharing the same label for local neighbors is relatively high.</li> <li>the probability of sharing the same label for the nodes with the same structural role is relatively high.</li> </ol> So the intuition of SC is that if two nodes are local neighbor or with same structural role, their predicted vector $\hat{y}$ should be similar. SC can help to the structure consistency by enlarging similarity between nodes with connection, and distinguish nodes without connection. Then the problem is that how to define local neighbor and the node with smae structural role? <ul> <li>For local neighbor nodes: it is just the node with direct edge connection with the node. The node pair is defined as $(v_i, v_j)\in E_t$</li> <li>For nodes with same structural role: we following the similar construction with Struct2Vec. Two nodes are similar if they have similar degree. More similar if their neighbor nodes also have the same degree. For more details, please refer the origin paper. <a href="https://arxiv.org/pdf/1704.03165.pdf">link</a> The node pair is defined as $(v_i, v_j)\in S_t$. $S_t$ is the graph structure conducted by the structure role information.</li> </ul> The mathmatical formulation of the SC loss is : <code class="language-plaintext highlighter-rouge">$$ \mathcal{L}_{SC} = \lambda_1 \sum_{(v_i, v_j) \in \mathbf{E_t}} \log J_{ij} - \epsilon \cdot \mathbb{E}_{v_n \sim p_{n}} [\log J_{in} ] + \lambda_2 \sum_{(v_i, v_j) \in \mathbf{\mathcal{S}_t}} \log J_{ij} - \epsilon \cdot \mathbb{E}_{v_n \sim p_{n}} \left[\log J_{in}\right] $$</code> where <code class="language-plaintext highlighter-rouge">$J_{ij} = \sigma(\left\langle\!\mathbf{\hat{y}_t^{(i)}}, \mathbf{\hat{y}_t^{(j)}}\!\right\rangle)$</code>, <code class="language-plaintext highlighter-rouge">$p_{n}$ and $p_{n}'$</code> are the distributions for negative samples, and $\epsilon$ is the number of negative samples. We use uniform distributions for <code class="language-plaintext highlighter-rouge">$p_{n}$</code> and <code class="language-plaintext highlighter-rouge">$p_{n}'$</code> while they can be adjusted if needed. $\epsilon$ is set as 5 in our experiments. In all our experiments except for the hyperparameter sensitive analysis, <code class="language-plaintext highlighter-rouge">$\lambda_1$</code> and <code class="language-plaintext highlighter-rouge">$\lambda_2$</code> are set to the default value 1.0. <h3 id="experiments">Experiments</h3> Firstly, I want to clarify our fairness in conducting experiments. I believe that our experiments are really fair comparison. <ul> <li>We ensure all the reported test result is after the select of the validation set.</li> <li>Though there are some hyperparameter in our method SOGA, we set all of them to the default value without any hyperparameter tuning.</li> <li>For baseline method with tuning hyperparameter, we do carefully grid search in a large range. And we have shown the range of each hyperparameter and the best hyperparameter on each dataset in the appendix for the reproducibility of our experiment.</li> <li>Each result is run by five random seeds for fair comparison.</li> </ul> We aims to answer four research questions in this section. I will try to quicky go through it. For the detailed results, check our paper. Then I am going to some remain problems needs to be understanding in the experiments. <h6 id="rq1-how-does-the-gcn-soga-compare-with-other-state-of-the-art-node-classification-methods-gcn-soga-indicates-soga-applying-on-the-default-source-domain-model-gcn">RQ1: How does the GCN-SOGA compare with other state-of-the-art node classification methods? (GCN-SOGA indicates SOGA applying on the default source domain model: GCN)?</h6> Even better result than other UGDA methods with access to the source data. <h6 id="rq2-can-soga-still-achieve-satisfactory-results-when-being-applied-to-different-source-domain-gnn-models-other-than-gcn">RQ2: Can SOGA still achieve satisfactory results when being applied to different source domain GNN models other than GCN?</h6> Though some model GAT and GraphSAGE show very bad performance without adaptation, all of them show comparable result after applying SOGA for adaptation which indicate the model agnostic property. <h6 id="rq3-how-do-different-components-in-soga-contribute-to-its-effectiveness">RQ3: How do different components in SOGA contribute to its effectiveness?</h6> <ul> <li>The weight of the source model is most important part, with only unsupervised loss. Model can learn nothing at all.</li> <li>IM loss can enhance the model performance. However, this improvement is not stable in the adapation procedure.</li> <li>SC loss can preseve the consistency and make the adaptation performance more stable</li> </ul> <h6 id="rq4-how-do-different-choices-of-hyperparameters-λ1-and-λ2-affect-the-performance-of-soga">RQ4: How do different choices of hyperparameters λ1 and λ2 affect the performance of SOGA?</h6> The performance is robust to these hyperparameter with little change. <h4 id="remain-question-in-experiment">Remain Question in experiment.</h4> Here we will answer some open questions have not been well understood yet. We do not have a precise answer here. <h6 id="why-soga-can-achieve-better-performance-then-other-ugda-method-with-access-to-the-source-data">Why SOGA can achieve better performance then other UGDA method with access to the source data?</h6> I think it’s mainly because the UGDA is far from their upper bound. The two losses in UGDA (cross entropy and domain align loss) is somehow conflict. Admittedly, the domain align loss like domain adversarial loss can make the domain look similar. However, as mentioned in the traditional method, what we really desire is. <ol> <li>The distance between the marginal distributions $ P(\phi(X_S))$ and $P(\phi(X_T))$ is small.</li> <li>$\phi(X_S)$ and $\phi(X_T)$ preserve important properties of $X_S$ and $X_T$</li> </ol> As the domain align loss may probably also abandon the important properties of $X_S$ and $X_T$, it may reduce the discriminative ability. As a result, the cross entropy loss will be larger. It may result in a fluctuating optimization procedure, and struck in a bad local minima. As in the below figure, the Macro-F1 score of the UDAGCN really fluctuating which may indicate our guess. However, the source free methods do not have such limitation. And this may be the reason for it. <h6 id="how-can-the-model-adapt-on-the-target-domain-without-explicit-adaptation-components">How can the model adapt on the target domain without explicit adaptation components?</h6> The first thing is that we cannot build explicit adaptation component since we have no access to the source data. It is impossible for us to do the alignment. The second thing is that even in the source free scenario, they will design some specific model component for example, batch norm, weight norm. They try to use the model to memory some information from the source domain for alignment and adaption. however，the specific design is not practical in the real-world. Adding these component means that we need to retrain the new model on the source domain while we cannot use the existing well-trained source model directly. So our algorithm may be not so fancy, but it indeed considers many real-world constraint which limits the fancy of our model. But it becomes more practical use. Back to the question: The connection between the source and the target is built by setting the source model as the initialization of the target model. Since we cannot access the source data, we can only leverage the information stored in the source model. Also Lemma 2 shows the important of the primary discriminative ability of the source model. However, as I mentioned in section3, the model weight is important. But what the real effectiveness of it is still under discussion like the pretrain weight. <h6 id="why-graphsage-and-gat-do-not-performance-well-in-some-case-when-directly-apply-on-the-target-domain-while-gcn-can-always-have-a-good-initial-performance">Why GraphSAGE and GAT do not performance well in some case when directly apply on the target domain while GCN can always have a good initial performance?</h6> This is really a confusing question that we have tuned the hyperparameters but after all, no good result is achieved. I am still wondering why they have this pheonomon. If you have any idea, discuss with me freely. <h3 id="conclusion-and-future-work">Conclusion and future work</h3> In this work, we articulate a new scenario called Source Free Unsupervised Graph Domain Adaptation (SFUGDA) with no access to the source graph because of practical reasons like privacy policies. Most existing methods cannot work well as it is impossible for feature alignment anymore. Facing the challenges in SFUGDA, we propose our algorithm SOGA, which could be applied to arbitrary GNN model by adapting to the shifted target domain distribution and enhancing the discriminative ability of the source model. Extensive experiments indicate the effectiveness of SOGA from multiple perspectives. Talking about the future work, I think there are quite a lot things to do as we are the first to open this scenario！ <ul> <li>How to further take advantage of the unsupervised graph information is a great topic. I do believe that we do not find the optimal to use the structure information.</li> <li>Some deep understanding are lose as the above section mentioned. We only ensure the correctness of the result but do not have know exactly why these things happends</li> </ul> This is a quite new scenario and I am quite looking forward to your follow up next! If you have any problem, please feel free to contact with me. I will be more than glad to help you. </article> </main></body></html>