|
--- |
|
license: apache-2.0 |
|
pipeline_tag: image-to-text |
|
--- |
|
|
|
# Moonline |
|
|
|
Moonline is a fork of [moondream2](https://huggingface.co./vikhyatk/moondream2). It combines the image to text generation with a modification of |
|
[outlines](https://github.com/outlines-dev/outlines) to be able to generate text according to a specific pydantic model. |
|
|
|
## Model Details |
|
|
|
The weights and the model strcture are directly from moondream2. The difference is that the Phi text model is swapped with a Phi model, that |
|
generates text according to a given structure. Since the outlines API doesn't work directly on embeddings, only the relevant parts are |
|
copy+pased and modified. |
|
|
|
### How to use |
|
|
|
The best way to start is by cloning the repo and running `example.py`. |
|
Make sure to set up a virtual enviroment and install the dependencies from the requirements.txt |
|
|
|
The example.py runs through a simple example of generating a description and a mood for the farm image. |
|
|
|
```python |
|
from PIL import Image |
|
from transformers import AutoTokenizer |
|
from pydantic import BaseModel |
|
from enum import Enum |
|
|
|
from moonline import Moonline |
|
|
|
def main(): |
|
class Mood(Enum): |
|
sad = "sad" |
|
happy = "happy" |
|
angry = "angry" |
|
neutral = "neutral" |
|
|
|
class ExampleModel(BaseModel): |
|
description: str |
|
mood: Mood |
|
|
|
prompt = f""" |
|
Your job is to describe the image. |
|
Please answer in json with the following format: {ExampleModel.__annotations__} |
|
""" |
|
|
|
image_path = "example.png" |
|
prompt = prompt |
|
|
|
model_id = "vikhyatk/moondream2" |
|
revision = "2024-04-02" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision) |
|
moonline = Moonline.from_pretrained( |
|
model_id, |
|
revision=revision, |
|
).to() |
|
moonline.eval() |
|
|
|
image = Image.open(image_path) |
|
image_embeds = moonline.encode_image(image) |
|
fsm = moonline.generate_fsm(ExampleModel, tokenizer) |
|
|
|
answer = moonline.answer_question(image_embeds, prompt, tokenizer, fsm) |
|
print(f"answer: {answer}") |
|
|
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
The result is something like this: |
|
|
|
```json |
|
{ |
|
"description": "A cartoon house is shown sitting on a dirt road with a long gravel path. Plants and trees surround the house. In the distance, there is a canal or pond with ducks swimming about. The scene is full of greenery, and flowers bloom among the vegetation. The sky is a clear blue, and a lush, verdant landscape can be spotted in the background. There is a pathway leading towards the house.", |
|
"mood": "happy" |
|
} |
|
``` |
|
|
|
### Limitations |
|
|
|
The model hallucinetes especially in cases where a field is given, that doesn't exist in the image. |
|
This can be alleviated by giving `None` options or guidance in the prompt. But in my experience this doesn't solve the issue fully. |
|
|
|
Moondream is also not specifically trained on json output. I expect results would be improved by fine-tuning on json descriptions of |
|
images. Especially cases where missing fields are present. |