Simple Text Multi Classification Task Using Keras BERT – also explains some theory about BERT alongwith the code. The important limitation of BERT to be aware of is that the maximum length of the sequence for BERT is 512 tokens. For shorter sequence input than the maximum allowed input size, we would need to add pad tokens [PAD]. On the other hand, if the sequence is longer, we need to cut the sequence. This BERT limitation on the maximum length of the sequence is something that you need to be aware of for longer text segments.
Transformer model for language understanding – COLAB code . implements transformer using Tensorflow as per the original paper — tried this extensively — problem with masking and also model size is constrained to 512?
Note: ImageNet dataset fact: This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images