moonline / README.md
erikkaum's picture
erikkaum HF staff
Update README.md
eabce06 verified
---
license: apache-2.0
pipeline_tag: image-to-text
---
# Moonline
Moonline is a fork of [moondream2](https://huggingface.co./vikhyatk/moondream2). It combines the image to text generation with a modification of
[outlines](https://github.com/outlines-dev/outlines) to be able to generate text according to a specific pydantic model.
## Model Details
The weights and the model strcture are directly from moondream2. The difference is that the Phi text model is swapped with a Phi model, that
generates text according to a given structure. Since the outlines API doesn't work directly on embeddings, only the relevant parts are
copy+pased and modified.
### How to use
The best way to start is by cloning the repo and running `example.py`.
Make sure to set up a virtual enviroment and install the dependencies from the requirements.txt
The example.py runs through a simple example of generating a description and a mood for the farm image.
```python
from PIL import Image
from transformers import AutoTokenizer
from pydantic import BaseModel
from enum import Enum
from moonline import Moonline
def main():
class Mood(Enum):
sad = "sad"
happy = "happy"
angry = "angry"
neutral = "neutral"
class ExampleModel(BaseModel):
description: str
mood: Mood
prompt = f"""
Your job is to describe the image.
Please answer in json with the following format: {ExampleModel.__annotations__}
"""
image_path = "example.png"
prompt = prompt
model_id = "vikhyatk/moondream2"
revision = "2024-04-02"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
moonline = Moonline.from_pretrained(
model_id,
revision=revision,
).to()
moonline.eval()
image = Image.open(image_path)
image_embeds = moonline.encode_image(image)
fsm = moonline.generate_fsm(ExampleModel, tokenizer)
answer = moonline.answer_question(image_embeds, prompt, tokenizer, fsm)
print(f"answer: {answer}")
if __name__ == "__main__":
main()
```
The result is something like this:
```json
{
"description": "A cartoon house is shown sitting on a dirt road with a long gravel path. Plants and trees surround the house. In the distance, there is a canal or pond with ducks swimming about. The scene is full of greenery, and flowers bloom among the vegetation. The sky is a clear blue, and a lush, verdant landscape can be spotted in the background. There is a pathway leading towards the house.",
"mood": "happy"
}
```
### Limitations
The model hallucinetes especially in cases where a field is given, that doesn't exist in the image.
This can be alleviated by giving `None` options or guidance in the prompt. But in my experience this doesn't solve the issue fully.
Moondream is also not specifically trained on json output. I expect results would be improved by fine-tuning on json descriptions of
images. Especially cases where missing fields are present.