Graphormer
Overview
The Graphormer model was proposed in Do Transformers Really Perform Bad for Graph Representation? by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen and Tie-Yan Liu. It is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.
The abstract from the paper is the following:
The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.
Tips:
This model will not work well on large graphs (more than 100 nodes/edges), as it will make the memory explode.
You can reduce the batch size, increase your RAM, or decrease the UNREACHABLE_NODE_DISTANCE
parameter in algos_graphormer.pyx, but it will be hard to go above 700 nodes/edges.
This model does not use a tokenizer, but instead a special collator during training.
This model was contributed by clefourrier. The original code can be found here.
GraphormerConfig
class transformers.GraphormerConfig
< source >( num_classes: int = 2 num_atoms: int = 4608 num_edges: int = 1536 num_in_degree: int = 512 num_out_degree: int = 512 num_spatial: int = 512 num_edge_dis: int = 128 multi_hop_max_dist: int = 5 spatial_pos_max: int = 1024 edge_type: str = 'multi_hop' max_nodes: int = 512 share_input_output_embed: bool = False num_hidden_layers: int = 12 embedding_dim: int = 768 ffn_embedding_dim: int = 768 num_attention_heads: int = 32 dropout: float = 0.1 attention_dropout: float = 0.1 activation_dropout: float = 0.1 layerdrop: float = 0.0 encoder_normalize_before: bool = False pre_layernorm: bool = False apply_graphormer_init: bool = False activation_fn: str = 'gelu' embed_scale: float = None freeze_embeddings: bool = False num_trans_layers_to_freeze: int = 0 traceable: bool = False q_noise: float = 0.0 qn_block_size: int = 8 kdim: int = None vdim: int = None bias: bool = True self_attention: bool = True pad_token_id = 0 bos_token_id = 1 eos_token_id = 2 **kwargs )
Parameters
-
num_classes (
int
, optional, defaults to 2) — Number of target classes or labels, set to 1 if the task is a regression task. -
num_atoms (
int
, optional, defaults to 512*9) — Number of node types in the graphs. -
num_edges (
int
, optional, defaults to 512*3) — Number of edges types in the graph. -
num_in_degree (
int
, optional, defaults to 512) — Number of in degrees types in the input graphs. -
num_out_degree (
int
, optional, defaults to 512) — Number of out degrees types in the input graphs. -
num_edge_dis (
int
, optional, defaults to 128) — Number of edge dis in the input graphs. -
multi_hop_max_dist (
int
, optional, defaults to 20) — Maximum distance of multi hop edges between two nodes. -
spatial_pos_max (
int
, optional, defaults to 1024) — Maximum distance between nodes in the graph attention bias matrices, used during preprocessing and collation. -
edge_type (
str
, optional, defaults to multihop) — Type of edge relation chosen. -
max_nodes (
int
, optional, defaults to 512) — Maximum number of nodes which can be parsed for the input graphs. -
share_input_output_embed (
bool
, optional, defaults toFalse
) — Shares the embedding layer between encoder and decoder - careful, True is not implemented. -
num_layers (
int
, optional, defaults to 12) — Number of layers. -
embedding_dim (
int
, optional, defaults to 768) — Dimension of the embedding layer in encoder. -
ffn_embedding_dim (
int
, optional, defaults to 768) — Dimension of the “intermediate” (often named feed-forward) layer in encoder. -
num_attention_heads (
int
, optional, defaults to 32) — Number of attention heads in the encoder. -
self_attention (
bool
, optional, defaults toTrue
) — Model is self attentive (False not implemented). -
activation_function (
str
orfunction
, optional, defaults to"gelu"
) — The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu"
,"relu"
,"silu"
and"gelu_new"
are supported. -
dropout (
float
, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. -
attention_dropout (
float
, optional, defaults to 0.1) — The dropout probability for the attention weights. -
activation_dropout (
float
, optional, defaults to 0.1) — The dropout probability after activation in the FFN. -
layerdrop (
float
, optional, defaults to 0.0) — The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details. -
bias (
bool
, optional, defaults toTrue
) — Uses bias in the attention module - unsupported at the moment. -
embed_scale(
float
, optional, defaults to None) — Scaling factor for the node embeddings. -
num_trans_layers_to_freeze (
int
, optional, defaults to 0) — Number of transformer layers to freeze. -
encoder_normalize_before (
bool
, optional, defaults toFalse
) — Normalize features before encoding the graph. -
pre_layernorm (
bool
, optional, defaults toFalse
) — Apply layernorm before self attention and the feed forward network. Without this, post layernorm will be used. -
apply_graphormer_init (
bool
, optional, defaults toFalse
) — Apply a custom graphormer initialisation to the model before training. -
freeze_embeddings (
bool
, optional, defaults toFalse
) — Freeze the embedding layer, or train it along the model. -
encoder_normalize_before (
bool
, optional, defaults toFalse
) — Apply the layer norm before each encoder block. -
q_noise (
float
, optional, defaults to 0.0) — Amount of quantization noise (see “Training with Quantization Noise for Extreme Model Compression”). (For more detail, see fairseq’s documentation on quant_noise). -
qn_block_size (
int
, optional, defaults to 8) — Size of the blocks for subsequent quantization with iPQ (see q_noise). -
kdim (
int
, optional, defaults to None) — Dimension of the key in the attention, if different from the other values. -
vdim (
int
, optional, defaults to None) — Dimension of the value in the attention, if different from the other values. -
use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models). -
traceable (
bool
, optional, defaults toFalse
) — Changes return value of the encoder’s inner_state to stacked tensors.Example —
This is the configuration class to store the configuration of a ~GraphormerModel. It is used to instantiate an Graphormer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Graphormer graphormer-base-pcqm4mv1 architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
GraphormerModel
The Graphormer model is a graph-encoder model.
It goes from a graph to its representation. If you want to use the model for a downstream classification task, use GraphormerForGraphClassification instead. For any other downstream task, feel free to add a new class, or combine this model with a downstream model of your choice, following the example in GraphormerForGraphClassification.
forward
< source >( input_nodes input_edges attn_bias in_degree out_degree spatial_pos attn_edge_type perturb = None masked_tokens = None return_dict: typing.Optional[bool] = True **unused )
GraphormerForGraphClassification
This model can be used for graph-level classification or regression tasks.
It can be trained on
- regression (by setting config.num_classes to 1); there should be one float-type label per graph
- one task classification (by setting config.num_classes to the number of classes); there should be one integer label per graph
- binary multi-task classification (by setting config.num_classes to the number of labels); there should be a list of integer labels for each graph.
forward
< source >( input_nodes input_edges attn_bias in_degree out_degree spatial_pos attn_edge_type labels: typing.Optional[torch.LongTensor] = None return_dict: typing.Optional[bool] = True **unused )