Categories: Image

Announcing Japanese InstructBLIP Alpha

Stability AI has released its first Japanese vision-language model, Japanese InstructBLIP Alpha. It can generate textual descriptions for input images as well as generate answers to questions given images.

Japanese InstructBLIP Alpha

Japanese InstructBLIP Alpha is a vision-language model that enables conditional text generation given images, built upon the Japanese large language model Japanese StableLM Instruct Alpha 7B that was recently released.

Figure1. Output: “Two persons sitting on a bench looking at Mt.Fuji”

The Japanese InstructBLIP Alpha leverages the InstructBLIP architecture, which has shown remarkable performance in various vision-language datasets. To make a high-performance model with a limited Japanese dataset, we initialized a part of the model with pre-trained InstructBLIP trained on large English datasets. We then fine-tuned this initialized model using the limited Japanese dataset.

Examples of applications of this model include search engine given images, a description/QA of the scene, and a textual description of the image for blind people, etc.

Performance

In addition to generating Japanese text, the Japanese InstructBLIP Alpha has the capability to accurately recognize Japan-specific objects (such as Tokyo Skytree or Kinkaku-ji).

Figure2. Output: “Sakura and Tokyo Skytree

Figure3. Output: “Kinkakuji temple in Kyoto”

Furthermore, input can also include text, such as a question. For example, as shown in the example below, it can answer questions about input images.

Figure4. Prompt: “What are the speed limits written on the road?”, Output: “30km/h”

Figure5. Prompt: “Which is the larger one?”, Output: “Left one”

Figure6. Prompt: “What color is the yukata of the person on the right?”, Output: “purple”

Terms of Use

The model is available on Hugging Face Hub and can be tested for inference and additional training. For more information, please visit the Hugging Face hub pages linked below:

The Japanese InstructBLIP Alpha was developed for research purposes and is available exclusively for research use. For more details, please refer to the Hugging Face Hub page.

About Stability AI

Stability AI is an open-access generative AI company working with partners to deliver next-generation infrastructure globally. Headquartered in London with developers around the world, Stability AI’s open philosophy provides new avenues for cutting-edge research in imaging, language, code, audio, video, 3D content, design, biotechnology, and other scientific research. For more information, visit https://stability.ai/

AI Generated Robotic Content

Share
Published by
AI Generated Robotic Content
Tags: ai images

Recent Posts

For very low resolution videos restoration, SeedVR2 is better than FlashVSR+ like 256px to 1024px

HD version is here since Reddit downscaled massively : https://youtube.com/shorts/WgGN2fqIPzo submitted by /u/CeFurkan [link] [comments]

22 hours ago

Can LLM Embeddings Improve Time Series Forecasting? A Practical Feature Engineering Approach

Using large language models (LLMs) — or their outputs, for that matter — for all…

22 hours ago

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find…

22 hours ago

Image upscale with Klein 9B

Prompt: upscale image and remove jpeg compression artifacts. Added few hours later: Please note that…

2 days ago

KV Caching in LLMs: A Guide for Developers

Language models generate text one token at a time, reprocessing the entire sequence at each…

2 days ago

Learnings from COBOL modernization in the real world

There’s a lot of excitement right now about AI enabling mainframe application modernization. Boards are…

2 days ago