erikkaum
/

moonline

Model card Files Files and versions Community

moonline / README.md

erikkaum's picture

erikkaum HF staff

Update README.md

eabce06 verified 9 months ago

|

history blame contribute delete

2.99 kB

	---
	license: apache-2.0
	pipeline_tag: image-to-text
	---

	# Moonline

	Moonline is a fork of [moondream2](https://huggingface.co./vikhyatk/moondream2). It combines the image to text generation with a modification of
	[outlines](https://github.com/outlines-dev/outlines) to be able to generate text according to a specific pydantic model.

	## Model Details

	The weights and the model strcture are directly from moondream2. The difference is that the Phi text model is swapped with a Phi model, that
	generates text according to a given structure. Since the outlines API doesn't work directly on embeddings, only the relevant parts are
	copy+pased and modified.

	### How to use

	The best way to start is by cloning the repo and running `example.py`.
	Make sure to set up a virtual enviroment and install the dependencies from the requirements.txt

	The example.py runs through a simple example of generating a description and a mood for the farm image.

	```python
	from PIL import Image
	from transformers import AutoTokenizer
	from pydantic import BaseModel
	from enum import Enum

	from moonline import Moonline

	def main():
	class Mood(Enum):
	sad = "sad"
	happy = "happy"
	angry = "angry"
	neutral = "neutral"

	class ExampleModel(BaseModel):
	description: str
	mood: Mood

	prompt = f"""
	Your job is to describe the image.
	Please answer in json with the following format: {ExampleModel.__annotations__}
	"""

	image_path = "example.png"
	prompt = prompt

	model_id = "vikhyatk/moondream2"
	revision = "2024-04-02"
	tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
	moonline = Moonline.from_pretrained(
	model_id,
	revision=revision,
	).to()
	moonline.eval()

	image = Image.open(image_path)
	image_embeds = moonline.encode_image(image)
	fsm = moonline.generate_fsm(ExampleModel, tokenizer)

	answer = moonline.answer_question(image_embeds, prompt, tokenizer, fsm)
	print(f"answer: {answer}")


	if __name__ == "__main__":
	main()
	```

	The result is something like this:

	```json
	{
	"description": "A cartoon house is shown sitting on a dirt road with a long gravel path. Plants and trees surround the house. In the distance, there is a canal or pond with ducks swimming about. The scene is full of greenery, and flowers bloom among the vegetation. The sky is a clear blue, and a lush, verdant landscape can be spotted in the background. There is a pathway leading towards the house.",
	"mood": "happy"
	}
	```

	### Limitations

	The model hallucinetes especially in cases where a field is given, that doesn't exist in the image.
	This can be alleviated by giving `None` options or guidance in the prompt. But in my experience this doesn't solve the issue fully.

	Moondream is also not specifically trained on json output. I expect results would be improved by fine-tuning on json descriptions of
	images. Especially cases where missing fields are present.