I tried to run florence-2 and colpali using the Huggingface serverless inference API. Searching around, there seems to pretty pretty start support for image-text-to-text models. On Github, I only found a few projects that even reference these types of models.

I didn’t really know what I was doing, so I copied the example code then tried to use a model to augment it to call florence-2. Initially, it seemed like it was working:

❯ python run.py
{'error': 'Model microsoft/Florence-2-large is currently loading', 'estimated_time': 61.724300384521484}

After the time passed, there seemed to be a problem:

❯ python run.py
{'error': 'image-text-to-text is not a valid pipeline'}

When trying out vidore/colpali, there seems to be an architectural incompatibility that prevents the serverless inference API from even loading the model

❯ python run.py
{'error': 'HfApiJson(Deserialize(Error("unknown variant `colpali`, expected one of `text-generation-inference`, `transformers`, `allennlp`, `flair`, `espnet`, `asteroid`, `speechbrain`, `timm`, `sentence-transformers`, `spacy`, `sklearn`, `stanza`, `adapter-transformers`, `fasttext`, `fairseq`, `pyannote-audio`, `doctr`, `nemo`, `fastai`, `k2`, `diffusers`, `paddlenlp`, `mindspore`, `open_clip`, `span-marker`, `bertopic`, `peft`, `setfit`", line: 1, column: 411)))'}

Overall, not a great success. Despite being more worked on an available than ever, models continue to be challenging to deploy for a layperson like me. Without an API, I’ve generally not made a ton of progress using large, state of the art models.