Howdy Forumers,
Got another computer vision guide (maybe the last one for a short while). We are looking at Moondream, a fantastic Vision Language Model capable of running on the Pi. Essentially, it is able to analyse an image like a human does, and use context and things in the image to answer questions in natural human language. Wickedly powerful tool, check it out: “Getting Started with Moondream on the Pi 5 | Human-Like Computer Vision”
Is it better than YOLO?
Great video and really impressive model. Thank you
Hey @ahsrab292840,
It fills a different role to YOLO. This takes 8-20 seconds to analyse a single frame on a Pi. Yolo can analyse up to 50 frames a second if you tried.
But the trade-off is that it sort of “thinks deeply” about the image and somewhat understands whats going on. This means that you can ask it questions about the image like “Is the car parked in the grass.” or “Is there a suspicious person in this image?”. YOLO on the other hand is only capable of detecting objects.
Hi Jaryd,
I followed your video/instruction, and my Moondream works well on my RPI 5. Only problem I faced is that the response (i.e., description of image) time is too long (I am using the 2B model). Can I speed up the response time using my Hailo-AI NPU or other accelerators? Many thanks, as always, and have a wonderful day!
Hey @Youngtae306064,
I am doubtful that Moondream can run on the AI HAT. It has a completely different set of internal workings and I wouldn’t even know where to start in terms of trying to port it over - a real curveball.
It may not be an ideal solution, but Moondream currently offers a generous free tier for its online cloud servers, which should get the response time down quite a lot. May change in the future, but you can have up to 5000 images processed per day which is plenty for most projects.
If you want something more local, you could also set up your own moondream server on another computer.
In terms of using a Moondream server, the package we installed has some instructions on getting started.
Hope this helps!
Dear Jaryd,
Thank you so much for your prompt response. I am currently developing AI edge devices for individuals with low vision and elder care.
Thanks to your videos and detailed instructions, I have successfully created glasses using an AI camera and a patient monitoring system using Hailo-AI pose-estimation.
I thought Moondream would run on the RPI 5 (as demonstrated on your recent video), but its response is quite slow, especially on the 2B model.
Do you have any suggestions on how I can speed up its response time on the RPI? I would like to avoid relying on Moondream’s online cloud servers, as I do not expect low-vision users to make use of its generous free tier.
Thank you again, and have a wonderful day!
Sincerely,
Young-tae Kim
Hey @Youngtae306064,
The Moondream model already pushes the Pi 5 to its limits. The only way to shorten response times is to shorten the output message as outlined in the video. Vision language models are still in their infancy and not efficient enough to run on a Pi 5 in real-time - or anything close to it! Maybe a future model will come out and be efficient enough to run the Pi 5, but for now we are at a bit of a dead-end.
I had a bit more of a look at even the possibility of running Moondream on the AI HAT. Long story short, the hardware was designed for models like YOLO and not things like Moondream or LLMs.
It may not be what you are after, but another option (if you haven’t already seen it) is YOLOE. Essentially, it’s a YOLO object detection model that is capable of recognising new things on the fly without additional training. It is a bit hit-and-miss and requires a bit of work to get it reliable and it might not be the answer you are looking for, but it’s one of the last things I can think of that helps.
Sounds like an awesome project as well, but I think you may be limited by the technology of our time. As far as I’m aware, even smart glasses made by billion-dollar companies haven’t figured out how to run everything on the edge - most of the processing is done via cloud services. Some simple tasks are processed locally (like if you said “take a photo”), but the tasks I think you are trying to do are done via the cloud. We are getting very close to running these on the edge though, so it may only be a couple of years or so before we can!
Cheers,
Jaryd
Dear Jaryd,
I greatly thank you for your clear guidance regarding my journey in developing AI-edge devices for low vision subjects and elder care.
I will continue working on YOLOE (I also learned about YOLO models on the RPI via your guidance) and will try to shorten the output message.
I sincerely hope that we can develop a local AI-edge device capable of providing low vision subjects and elders with fast, always-on, and detailed environmental information via sound.
Thank you again, and have a wonderful day!
Sincerely,
Young-tae Kim
No worries, best of luck!
If you do find anything interesting or a way to speed this up, please do let us know!
Good luck with the project!
