This blog revisits the underlying principles of the Vision Transformer (ViT) and proposes explorations on extending the Transformer architecture to other modalities. For instance, in tasks such as "Weight2Weight," prior approaches have often simply flattened weights into one-dimensional tensors, without leveraging positional encodings.