Detail of Publication
Text Language | English |
---|---|
Authors | Rina Buoy, Masakazu Iwamura, Sovila Srun, Koichi Kise |
Title | ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducerfor Scene Text Recognition |
Journal | Journal of Imaging |
Vol. | 9 |
No. | 12 |
Number of Pages | 17 pages |
Publisher | MDPI |
Reviewed or not | Reviewed |
Month & Year | December 2023 |
Abstract | Attention-based encoder?decoder scene text recognition (STR) architectures have been proven effective in recognizing text in the real world, thanks to their ability to learn an internal language model. Nevertheless, the cross-attention operation that is used to align visual and linguistic features during decoding is computationally expensive, especially in low-resource environments. To address this bottleneck, we propose a cross-attention-free STR framework that still learns a language model. The framework we propose is ViTSTR-Transducer, which draws inspiration from ViTSTR, a vision transformer (ViT)-based method designed for STR and the recurrent neural network transducer (RNN-T) initially introduced for speech recognition. The experimental results show that our ViTSTR-Transducer models outperform the baseline attention-based models in terms of the required decoding floating point operations (FLOPs) and latency while achieving a comparable level of recognition accuracy. Compared with the baseline context-free ViTSTR models, our proposed models achieve superior recognition accuracy. Furthermore, compared with the recent state-of-the-art (SOTA) methods, our proposed models deliver competitive results. |
DOI | 10.3390/jimaging9120276 |
- Entry for BibTeX
@Article{Buoy2023, author = {Rina Buoy and Masakazu Iwamura and Sovila Srun and Koichi Kise}, title = {ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducerfor Scene Text Recognition}, journal = {Journal of Imaging}, year = 2023, month = dec, volume = {9}, number = {12}, numpages = {17}, DOI = {10.3390/jimaging9120276}, publisher = {MDPI} }