Authors: Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, Peng Gao
Paper Content:
Page 1:
SPHINX-X: Scaling Data and Parameters for a Family of
Multi-modal Large Language Models
Dongyang Liu* 1 2Renrui Zhang* 1 2Longtian Qiu* 2Siyuan Huang* 2Weifeng Lin* 2Shitian Zhao2
Shijie Geng3Ziyi Lin1 2Peng Jin2Kaipeng Zhang2Wenqi Shao2Chao Xu2Conghui He2
Junjun He2Hao Shao1Pan Lu4Yu Qiao†2Hongsheng Li†1 5Peng Gao† ‡* 2
Abstract
We propose SPHINX-X, an extensive Multi-
modality Large Language Model (MLLM) se-
ries developed upon SPHINX. To improve the
architecture and training efficiency, we modify the
SPHINX framework by removing redundant vi-
sual encoders, bypassing fully-padded sub-images
with skip tokens, and simplifying multi-stage
training into a one-stage all-in-one paradigm. To
fully unleash the potential of MLLMs, we as-
semble a comprehensive multi-domain and multi-
modal dataset covering publicly available re-
sources in language, vision, and vision-language
tasks. We further enrich this collection with our
curated OCR intensive and Set-of-Mark datasets,
extending the diversity and generality. By training
over different base LLMs including TinyLlama-
1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral-
8×7B, we obtain a spectrum of MLLMs that vary
in parameter size and multilingual capabilities.
Comprehensive benchmarking reveals a strong
correlation between the multi-modal performance
with the data and parameter scales. Code and mod-
els are released at https://github.com/
Alpha-VLLM/LLaMA2-Accessory .
1. Introduction
Since the release of OpenAI’s GPT-4 (V) (OpenAI, 2023)
and Google’s Gemini (Gemini Team, 2023), Multi-modal
Large Language Models (MLLMs) have become an increas-
ingly popular research area (Fu et al., 2023c; Yang et al.,
∗Equal Contribution†Corresponding Authors‡Project
Lead1MMLab, CUHK2Shanghai AI Laboratory3Rutgers
University4University of California, Los Angeles5Centre
for Perceptual and Interactive Intelligence (CPII). Correspon-
dence to: Peng Gao <gaopeng@pjlab.org.cn >, Hongsheng Li
<hsli@ee.cuhk.edu.hk >, Yu Qiao <qiaoyu@pjlab.org.cn >.
Proceedings of the 41stInternational Conference on Machine
Learning , Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s).
Average PerformanceParameter Size𝟏𝐁𝟐𝟏𝟎𝟎𝟐𝟒𝟎𝟎SPHINX-Tiny𝟐𝟎𝟎𝟎𝟐𝟑𝟎𝟎𝟐𝟐𝟎𝟎𝟕𝐁𝟏𝟑𝐁𝟖×𝟕𝐁SPHINX-XFamilySPHINX-Intern2SPHINXSPHINX-PlusSPHINX-MoEDataScaling-up𝟐𝟓𝟎𝟎𝟐𝟔𝟎𝟎Figure 1. Performance comparison with data and parameter
scaling. We introduce SPHINX-X, a general and well-performing
MLLM family developed upon SPHINX (Lin et al., 2023).
2023e). By aligning multi-modal encoders with Large Lan-
guage Models (LLMs), MLLMs demonstrate the potential
to unlock myriad novel applications and further push the
boundary of next-level artificial general intelligence, span-
ning from embodied intelligence (Geng et al., 2023), au-
tonomous driving (Wen et al., 2023; Cao et al., 2023; Yang
et al., 2023d) to graphical user interfaces (GUI) agents (He
et al., 2024; Yang et al., 2023f).
Inspired by this, a wide array of open-source MLLMs
have been developed within merely one year, including
BLIP series (Li et al., 2023b; Dai et al., 2023), LLaMA-
Adapter (Zhang et al., 2024c; Gao et al., 2023), LLaV A (Liu
et al., 2023b;a; Li et al., 2024a;b), MiniGPT-4 (Zhu et al.,
2023a), mPLUG-Owl (Ye et al., 2023b;c), SPHINX (Lin
et al., 2023), and the visual mathematical MA VIS (Zhang
et al., 2024e). Although these open-source MLLMs demon-
strate promising multi-modal capabilities, their performance
is still constrained by the training data from few task do-
mains and limited choices of LLM parameters:
Limited Data Coverage for Tasks. Popular open-source
MLLMs, such as BLIP-2, LLaV A, and LLaMA-Adapter, are
typically trained on raw vision-language data from the nat-
ural image domain (e.g., LAION (Schuhmann et al., 2021;
2022), SBU (Ordonez et al., 2011), and Conceptual Cap-
1arXiv:2402.05935v2 [cs.CV] 26 Jun 2024
Page 2:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
SPHINX-X:这张图片捕捉到了一幅令人惊叹的景象…。这道波浪是图片的主要焦点,它被涂成了一种鲜艳的蓝色,与深蓝色的背景形成了鲜明的对比。…在波浪的右侧,可以看到一座山,它的轮廓在深蓝色的天空中显得格外突出。…波浪和山的相对位置表明波浪正在向山冲来,暗示着一种运动和能量。…Bilingual Multi-modal UnderstandingSPHINX-X:这张图与日本画家葛饰北斋的画作《神奈川冲浪里》非常相似。这幅画作描绘了日本海岸附近的一场大浪,与这张图所展示的波浪有着相似的动态和规模。
SPHINX-X:import matplotlib.pyplot as plt# Define the datadata = {'Human': 95.7,'Gemini': 40.7,'GPT-4V': 38.7,'Random Guess': 25.0,'LLaVA-1.5': 24.7,'Bard': 19.0,'Bing Chat': 17.3,'InstructBLIP': 16.7,'mini-GPT4': 12.7,'LLaVA': 6.0}User:Write a python code to generate this image.Multi-modal Programming# Create the bar plotplt.bar(data.keys(), data.values())# Add labels and titleplt.xlabel('Model')plt.ylabel('Accuracy (%)')plt.title('Accuracy of Models')# Show the plotplt.show()
Geometric ReasoningUser:Please detect all texts and provide their bounding box coordinates.Text Spotting
User: Add a new city.GUI Grounding
User:请用框框检测出来山的位置。SPHINX-X:山[0.46, 0.49, 0.71, 0.63]User:这张图跟日本哪个画家的画类似?User:这请详细描述这张图。SPHINX-X:There is a tallest building [0.47, 0.44, 0.49, 0.50]in the image.User:Detect the tallest building in the image.High-resolution Detection
Pose EstimationUser:Detect the key points of the horse and person in the image.SPHINX-X:The detected key points are horse keypoints[0.45,0.26, … , 0.24,0.68], and human keypoints[0.64,0.32, …, 0.68,0.74]SPHINX-X:[0.05, 0.05, 0.1, 0.12]
User: Back to previous page.SPHINX-X:[0.11, 0.12, 0.14, 0.15]
User: Adjust the volume.SPHINX-X:[0.88, 0.90, 0.90, 0.92]
SPHINX-X:4. If the line m is parallel to the side AB of ?ABC, what is angle a? a. 130°b. 25°c. 65°d. 50User:What's the question?SPHINX-X:Since line m is parallel to side AB. then angle ais congruent to angle A. Since angle A is 50 degrees, then angle ais 50 degrees.User:What's the answer?SPHINX-X: d.User:Perform reasoning then answer.
BILLGATESTHEROADAHEAD
Figure 2. Demonstrations of SPHINX-X. With our proposed data and training strategies, SPHINX-X can achieve superior multi-modal
understanding and reasoning capabilities in diverse domains, e.g., bilingual, serving as a multi-purpose vision generalist.
tions (Sharma et al., 2018)), and visual instruction-following
data (Chen et al., 2023b; Wang et al., 2023a) generated by
GPT-4 (V) (OpenAI, 2023). As a result, they normally ex-
hibit a superior multi-modal understanding performance in
natural images. However, they display limited or degraded
results in out-of-domain scenarios, such as Optical Charac-
ter Recognition (OCR), table, chart, and mathematics fields,
where in-depth domain-specific knowledge is critical. In
contrast, domain-specific MLLMs like Shikra (Chen et al.,
2023a), mPLUG-DocOwl (Ye et al., 2023a), and Kosmos-
2.5 (Lv et al., 2023) are tailored to excel in specific tasks,
but at the expense of their general multi-modal capabilities.
Limited Choices of Model Parameters. Most open-source
MLLMs are developed on top of dense LLMs, e.g.,
LLaMA (Touvron et al., 2023a;b), with 7B or 13B param-
eters. While such parameter counts are often prohibitively
large for deployment on portable devices, the same number
of parameters remains inadequate to fully explore the perfor-
mance boundaries of MLLMs. Therefore, scaling down the
model scale of MLLMs could facilitate the broader adoption
of mobile devices. Meanwhile, scaling up the parameter
count through the integration of sparsely-activated Mixture-of-Experts (MoE) architecture (Shazeer et al., 2017) could
also unlock the full potential of MLLMs in addressing com-
plex real-world multi-modal challenges.
To resolve the aforementioned limitations of existing
MLLMs, we introduce a family of MLLMs termed
SPHINX-X by extending the data coverage of tasks and
parameter scales in SPHINX, as shown in Figure 1. The
superior multi-modal generalization capacity of SPHINX-X
for a diversity of tasks is exhibited in Figure 2. Importantly,
we adjust the training process and model architecture of
SPHINX to better accommodate the efficient and large-scale
multi-modal training:
➀Modifications over SPHINX. For the mixed four vision
encoders in SPHINX, we only preserve two of them, i.e.,
CLIP-ConvNeXt (Liu et al., 2022) and DINOv2 (Oquab
et al., 2023). Considering their distinct methodologies and
architectures, the two encoders can provide the most com-
plementary visual semantics, denoted as Mixture ofVisual
experts ( MoV ). Then, for the sub-image division strategy
of high-resolution images, if the input image has a large as-
pect ratio, we observe a frequent occurrence of fully-padded
2
Page 3:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
sub-images, where all pixels are zeros. To address this,
we adopt a learnable skip token to represent them within
LLMs, thereby shortening the sequence length for efficiency,
while still preserving the relative positions of sub-images.
Furthermore, given the increased training data volume, we
condense the previous multi-stage training pipeline into a
more straightforward single-stage paradigm. Instead of fine-
tuning different parts of LLM parameters in two stages with
different datasets, we directly train all the parameters of
LLMs on all our collected datasets.
➁Multi-Domain and Multi-Modal Datasets. To fully un-
leash the potential of MLLMs, we assemble an extensive
collection of public datasets that span a wide array of tasks,
and carefully extend two self-curated multi-modal datasets.
In detail, we collect the public datasets from the realms of
vision, language, and vision-language tasks, and reformu-
late them into a unified multi-turn conversational format.
Moreover, to specifically enhance the targeted capacity of
MLLMs, we further construct an OCR-intensive dataset
and a Set-of-Mark (SoM) dataset. The expansion of OCR
data processed from substantial PDFs can unlock the visual
language understanding power of MLLMs, e.g., text spot-
ting and document layout detection. The specialized SoM
data also compensates for the SoM prompting (Yang et al.,
2023c) potentials of SPHINX-X, for which we construct
delicate SoM annotations in diverse domains by GPT-4.
➂LLM Parameter Scaling of SPHINX-X. With the afore-
mentioned techniques and large-scale datasets, we marry
SPHINX-X with various base LLMs of increasing parameter
scales: TinyLlama-1.1B (Zhang et al., 2024b), InternLM2-
7B (Team, 2023), LLaMA2-13B (Touvron et al., 2023b),
and Mixtral-8 ×7B (Jiang et al., 2024a). Respectively, we de-
velop a family of MLLMs that facilitate fast mobile deploy-
ment (SPHINX-Tiny), provide bilingual support (SPHINX-
Intern2), possess moderate parameters with data scaling
(SPHINX-Plus), and exhibit strong reasoning capabilities
through Mixture-of-Expert architectures (SPHINX-MoE).
Extensive evaluations across a wide range of benchmarks
reveal that SPHINX-Plus surpasses the original SPHINX,
confirming that enriching dataset scales and diversity can
benefit the performance. Furthermore, a comparison of base
LLMs from 1.1B to 7 ×8B demonstrates that under the same
training pipeline, scaling up the parameters can consistently
boost the multi-modal understanding capabilities. Overall,
we summarize the key contributions as follows:
•We release a family of well-performing MLLMs tailored
from fast inference on mobile devices to complex reason-
ing tasks on high-end computers. A comprehensive range
of experiments demonstrates that the scale of training data
and the size of LLM parameters both play a critical role
in the performance of MLLMs.
•We perform several modifications over SPHINX by elimi-nating redundant visual encoders, avoiding fully-padded
sub-images with learnable skip tokens, as well as stream-
lining the complex multi-stage training pipeline into a
single-stage all-in-one paradigm.
•We collected an extensive multi-modal dataset covering a
broad spectrum of tasks and modalities. On top of that, we
curated two new datasets for enhancing the OCR-intensive
and Set-of-Marks prompting capabilities of MLLMs.
2. Related Work
Large Language Models (LLMs) Advancements in
recent MLLM research are based on the breakthrough
of LLMs constructed upon the Transformer architec-
ture (Vaswani et al., 2017), where progress has stemmed
from both an expansion of training data and a significant
increase in model parameters. For instance, GPT-3 (Brown
et al., 2020), boasting 175B parameters, excels at few-shot
in-context learning, while GPT-2 (Radford et al., 2019) with
1.5B parameters falls short of reaching this level of per-
formance. Inspired by GPT-3’s success, several LLMs
like PaLM (Chowdhery et al., 2022), OPT (Zhang et al.,
2022b), BLOOM (Workshop et al., 2022), and LLaMA have
emerged. Mistral (Jiang et al., 2023) further introduced win-
dow attention for enhanced long-context modeling, while
Mixtral 8 ×7B leveraged sparse MoE layers (Fedus et al.,
2022; Lepikhin et al., 2020; Shazeer et al., 2017) to upscale
parameters efficiently, outperforming with fewer active pa-
rameters. Concurrently, models such as Qwen (Bai et al.,
2023), Baichuan (Yang et al., 2023a), and InternLM (Team,
2023) have advanced bilingual LLM capabilities, whereas
TinyLlama (Zhang et al., 2024b) and Phi-2 (Microsoft,
2023) focus on reducing parameters for edge deployment.
Our SPHINX family extends LLMs to multimodal domains
for visual understanding and reasoning. We select four
LLMs with different pre-training and parameter scales, com-
paring their performance under multi-modal scenarios.
Multi-modal Large Language Models (MLLMs) Ef-
forts to extend LLMs to perceive beyond text have birthed
MLLMs, with vision as the primary modality. Represen-
tative architectures include BLIP series (Li et al., 2022;
2023b) and MiniGPT-4 (Zhu et al., 2023a), which em-
ploy query Transformers to summarize visual features;
Flamingo (Alayrac et al., 2022), which uses gated cross-
attention to support interleaved image-text inputs; The
LLaMA-Adapter series (Zhang et al., 2024c; Gao et al.,
2023) that introduce zero-initialized attention to mini-
mize interference between visual and language tokens;
and LLaV A (Liu et al., 2023b;a), which connects visual
tokens to LLMs with a simple linear layer to incorpo-
rate visual knowledge. There are also recent advances in
MLLMs that have demonstrated remarkable extended capa-
bilities. For example, Shikra (Chen et al., 2023a) and Vision-
3
Page 4:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
LLM (Wang et al., 2023b) excel in referring object detec-
tion, while ChartAssistant (Meng et al., 2024) and mPLUG-
DocOwl/PaperOwl (Ye et al., 2023a; Hu et al., 2023) fo-
cus on tables, documents, and scientific diagrams analysis.
LLaV A-NeXT-Interleave (Li et al., 2024b) targets multi-
image interleaved instruction tuning, and MA VIS (Zhang
et al., 2024e) curates large-scale visual mathematical tun-
ing datasets for challenging problem solving. Many efforts
also extend LLMs into more modalities, such as ImageBind-
LLM (Han et al., 2023a), Point-LLM (Guo et al., 2023),
and others (Zhu et al., 2023b; Zhang et al., 2022a; 2023b).
In this paper, we upgrade SPHINX (Lin et al., 2023) to an
MLLM family for more general visual instruction following,
achieving superior performance over various benchmarks.
3. Method
We first revisit the design principles of SPHINX in Sec-
tion 3.1. We then respectively detail the three improvements
made to SPHINX-X in Section 3.2 concerning the succinct-
ness of visual encoders, learnable skip tokens for useless
visual signals, and simplified one-stage training. Lastly, we
illustrate the composition of our large-scale multi-modality
dataset in Section 3.3, as well as introduce different base
LLMs adopted by the SPHINX-X family in Section 3.4.
3.1. A Revisit of SPHINX
SPHINX (Lin et al., 2023) proposes three types of mix-
ing strategies to develop a multi-purpose MLLM – mixing
of model weights, tuning tasks, and visual embeddings.
Following previous efforts (Gao et al., 2023; Liu et al.,
2023b), SPHINX adopts a two-stage training pipeline, in
which the first stage aligns pre-trained vision encoders with
LLaMA2 (Touvron et al., 2023b), and the second stage
integrates a variety of tasks for instruction tuning. For
more robust visual representations, SPHINX incorporates
the embeddings of four different vision encoders, including
CLIP-ViT (Radford et al., 2021; Dosovitskiy et al., 2020),
CLIP-ConvNeXt (Liu et al., 2022), DINOv2 (Oquab et al.,
2023), and Q-former (Li et al., 2023c). SPHINX then intro-
duces a multi-scale mixing strategy to tackle high-resolution
images, which divides the high-resolution input into several
sub-images along with a downsampled image for concurrent
encoding. In addition, to further mix various domain seman-
tics, SPHINX fuses the first-stage weights of LLMs that are
tuned by different data domains. Despite its superior per-
formance, SPHINX is still constrained by the cumbersome
two-stage training process and mixed architectures, and it
has yet to fully capitalize on the potential benefits of data
and model scaling. Motivated by this, we develop SPHINX-
X, an extensive series of MLLMs to explore a more general
and comprehensive multi-modal learning paradigm.3.2. SPHINX-X
To better handle large-scale multi-task and multi-modal
instruction-tuning, we perform the following improve-
ments over SPHINX-X, enabling the training pipeline and
model architecture to be concise. We present the upgraded
SPHINX-X training pipeline in Figure 3.
Eliminating Redundant Visual Encoders. SPHINX em-
ploys four complementary vision encoders to capture di-
verse visual representations. Although the mixture of visual
experts can improve the performance, it inevitably leads to
a significant increase in computational costs, especially for
a group of sub-images generated from a high-resolution in-
put. To obtain better computational efficiency, we eliminate
the CLIP-ViT and Q-former encoders, only preserving two
visual encoders – DINOv2 and CLIP-ConvNeXt. As these
two models are pre-trained by distinct learning approaches
(self-supervised vs. weakly-supervised) and network archi-
tectures (ViT vs. CNN), they can already provide the most
complementary and refined visual knowledge. We denote
them as the Mixture ofVisual experts ( MoV ).
Bypassing Fully-padded Sub-images with Skip Tokens.
The superior performance of SPHINX can be attributed to
its effective handling of high-resolution images with several
local sub-images and one global downsampled image. Dur-
ing the training stages of SPHINX, all images by default are
scaled and zero-padded to a high resolution 448×448, and
then divided into four 224×224sub-images. However, for
images with large aspect ratios, say 2 : 1, this operation will
result in fully-padded sub-images filled entirely with zero-
value pixels. Such fully-padded sub-images not only contain
noisy and useless visual signals, but also produce spare vi-
sual tokens that waste computational resources within both
MoV and LLM. To alleviate the issue, we propose a learn-
able skip token to replace the fully-padded sub-image, which
provides explicit relative positional information for LLMs
to identify the positions between useful sub-images. In this
way, MoV can avoid encoding these zero-pixel sub-images,
which allows for a reduction in the input sequence length
for LLMs, achieving enhanced computational efficiency.
One-Stage All-in-One Training. The original training
pipeline of SPHINX comprises two stages and utilizes a
weight mixing strategy. However, it requires to manually as-
sign various tunable parameters and dataset combinations to
different training stages, which is a labor-intensive task. To
simplify the overall paradigm, we design a single-stage all-
in-one training pipeline, which treats all collected datasets
equally and uniformly transforms them into multi-modal
multi-turn dialog formats. During the one-stage training,
we unfreeze all the parameters of SPHINX (i.e., LLM and
intermediate projection layers) except for the two visual en-
4
Page 5:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
TinyLlama / InternLM2 / LLaMA2 / Mixtral
MoV MoV MoV
Fully -padded
Sub-imageZero
Padding
Region
Fully -padded
Sub-image
Sub-images Low-resolution ImageDownsampleDivide
High-resolution ImageSkip
TokenSkip
TokenA squirrel is standing on a
A squirrel is standing on a tomato
Unfreeze FreezeSPHINX -X Family
Figure 3. Overall paradigm of SPHINX-X family. On top of SPHINX (Lin et al., 2023), we adopt three modifications for a more general
and concise architecture: removing redundant visual encoders in Mixture of Visual Experts (MoV), bypassing fully-padded sub-images
with skip tokens, and simplifying multi-stage training into a one-stage all-in-one approach.
coders in MoV . Due to the large volume of training data and
high reasoning capacity of LLMs, our one-stage all-in-one
strategy can significantly streamline the training procedure
for MLLMs while maintaining high performance.
3.3. Training Data of SPHINX-X
To obtain remarkable multi-modal capabilities, we widely
convert three categories of public training data into
instruction-following formats (language, vision, and vision-
language), and carefully curate two targeted multi-modal
datasets (OCR-intensive and Set-of-Mark) for SPHINX.
All data is combined for the one-stage all-in-one training.
Specifically, for natural language data, we utilized datasets
that include multi-turn dialog, mathematical reasoning, and
code generation. For vision data, we convert data from
diverse computer vision tasks including image-level and
object-level understanding into multi-turn conversation for-
mats. For vision-language data, we collect various visual
question-answering, visual instruct-tuning, and fine-grained
image captioning datasets. On top of this, we generate an
OCR dataset from large-scale PDF data, and a multi-domain
Set-of-Marks dataset with fine-grained multi-modal knowl-
edge.
For the three parts that are mainly composed of exist-
ing datasets (namely language instruction-following, vi-sual instruction-following and vision-language instruction-
following ), we defer the details to the appendix (Sec. A.3).
The data statistic information is also provided in the ap-
pendix (Table 9).
OCR-intensive Data. Most previous MLLMs can only
leverage external tools and pre-extracted OCR tokens to
obtain satisfactory OCR-related understanding. To enhance
such capabilities for MLLMs, we compile an OCR-intensive
dataset from extensive Internet PDF data. Different from
previous synthetic OCR data (Yim et al., 2021; Kim et al.,
2021) that are too simple and far from real-world appli-
cation, our dataset is more challenging and larger-scale.
Specifically, we first collect large-scale PDF datasets from
Common Crawl1and arXiv websites. Then, we utilize
PyMuPDF2to get the rendering results of each page in
the PDF file and also save all the text annotations along
with their bounding boxes. To ensure the OCR quality,
we adopt multiple processing methods, including Unicode
characteristic checking, text splits merge, etc. In this way,
we constructed an in-house PaperText dataset with about
3M text-dense pages. Finally, we transform them into a
unified question-answering format to strengthen the OCR
documentation understanding ability.
1Common Crawl:https://commoncrawl.org/
2PyMuPDF: https://github.com/pymupdf/PyMuPDF
5
Page 6:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Multi-Domain Set-of-Mark Data. We notice that existing
multi-modal datasets lack the fine-grained correspondence
between images and texts. Thus, we construct a multi-
domain dataset similar to Set-of-Marks techniques (Yang
et al., 2023c) to endow MLLMs with dense multi-modal
captioning knowledge. Initially, we collect diverse image
datasets from various domains. Then, we utilize dataset an-
notations such as bounding boxes and object masks to place
various marks like points, boxes, polygons, and identifiers,
on the raw images. After that, we craft domain-specific
instructions for each data type, and prompt GPT-4V with
the masked images for multi-scale captioning, which gener-
ates captions of global image understanding, detailed region
captioning, and object-relation analysis. Such SoM prompt-
ing for GPT-4V can motivate its power to produce higher-
quality and fine-grained multi-modal data. During training,
we do not utilize the marked images, but the raw images,
and describe the marks by language within the multi-turn
conversations for uniformity with other data domains.
3.4. SPHINX-X with Different LLMs
Built upon the aforementioned techniques and large-scale
datasets, we provide four choices of base LLMs in
SPHINX-X with increasing parameter scales: TinyLlama-
1.1B (Zhang et al., 2024a), InternLM2-7B (Team, 2023),
LLaMA2-13B (Touvron et al., 2023b), and Mixtral-
8×7B (Jiang et al., 2024a). We introduce their features
compared to the original SPHINX with LLaMA2-13B.
SPHINX-Tiny with TinyLlama-1.1B. TinyLlama can be
regarded as a lightweight version of LLaMA. The compact-
ness of 1.1B parameters allows TinyLlama to apply to a
diversity of scenarios with limited computation resources.
Therefore, we train SPHINX-Tiny to observe how the multi-
modal performance varies given the smaller-scale LLM.
SPHINX-MoE with Mixtral-8 ×7B. As a sparse Mixture-of-
Experts (MoE) LLM, Mixtral-8 ×7B utilizes 8 feed-forward
networks at each transformer layer as experts, and relies on
a router network to activate two experts each time. With this
sparse mechanism, we expect to analyze the characteristics
of different experts for multi-modal instruction following.
SPHINX-Plus with LLaMA2-13B. SPHINX-Plus utilizes
the same scaled LLaMA, i.e., 13B parameters, with the
original SPHINX, but is tuned by our constructed multi-
modal dataset (more diverse and larger-scale) with one stage.
This to some extent can illustrate the efficacy of data scaling-
up. Note that, referring to SPHINX-2K (Lin et al., 2023), we
also perform an improved version, termed SPHINX-Plus-
2K, which increases the image resolution from 448×448
to672×672and splits the input to 3×3sub-images for
fine-grained visual understanding.
SPHINX-Intern2 with InternLM2-7B. InternLM (Team,2023) is a strong bilingual LLM pre-trained on large Chinese
and English corpus. The recently released InternLM2-7B
shows stronger bilingual language understanding ability.
We adopt InternLM2-7B as the base LLM to explore the
performance pattern under the regular 7B-parameter LLM
setup and the potential of bilingual multi-modal reasoning.
4. Experiment
4.1. Experimental Settings
All SPHINX-X models presented in the paper follow the
one-stage all-in-one training strategy, and all modules ex-
cept the visual encoders are optimized. The learning rate is
set to 5e-6 for SPHINX-MoE, and 2e-5 for others. During
training, the learning rate first linearly warmups to the target
value within the first 0.01 epoch, and then gradually decays
to 0 following the cosine schedule. We use the AdamW
optimizer with weight decay = 0and betas = (0.9,0.95).
To accommodate the large model volume, a combination
of ZeRO2-style (Rajbhandari et al., 2020) data parallel and
Megatron-style (Shoeybi et al., 2019) model parallel is used.
The model parallel size is set to 8 for SPHINX-MoE, 2
for SPHINX-Plus, and 1 for others. The effective batch
size is 256. Note that SPHINX-Plus is initialized from the
SPHINX model, while other SPHINX-X models use the
original visual encoders/LLMs and randomly initialized lin-
ear projection layers. In Figure 1, we report the cumulative
scores of SPHINX-X models over benchmarks mentioned
in Tables 1, 4, and 5. However, due to the excessively
high overall score, we excluded MME (Fu et al., 2023b)
perception to balance the results.
4.2. Performance Evaluation
In this section, we conduct a thorough assessment and
present outcomes across various benchmarks, offering an
extensive overview and evaluation of our SPHINX-X family.
MLLM Benchmarks. We evaluate SPHINX-X on re-
cently introduced benchmarks, such as MME (Fu et al.,
2023a), Seedbench (Li et al., 2023a), POPE (Li et al.,
2023f), LLaV A-Bench (In-the-Wild) (Liu et al., 2023b),
MM-Vet (Yu et al., 2023b), MathVista (Lu et al., 2023), MM-
bench (Liu et al., 2023d), CCbench (Contributors, 2023),
Tiny LVLM (Shao et al., 2023) and BenchLLM (Cai et al.,
2023a), InfiMM-Eval (Han et al., 2023b), Qbench (Cai et al.,
2023b) for multi-modal language models (MLLM) to pro-
vide a comprehensive assessment of its characteristics. The
results, presented in Table 1, showcase SPHINX-X’s state-
of-the-art performance across various multi-modal tasks,
including mathematical reasoning, complex scene under-
standing, low-level vision tasks, and visual quality assess-
ment, as well as resilience when facing illusions.
Visual Question Answering. The evaluation on general
6
Page 7:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Table 1. Performance comparison with state-of-the-art methods on popular MLLM benchmarks.
Methods POPE MMEPMMECMMB SEED LLaV AWMM-Vet CCbench MathVista Tiny LVLM BenchLMM InfiMM-Eval Qbench
BLIP-2 85.3 1293.8 - - 46.4 38.1 22.4 - - 284.7 - - -
InstructBLIP-7B - - - 36.0 53.4 60.9 26.2 12.1 25.3 300.6 44.63 - 56.7
InstructBLIP-13B 78.9 1212.8 - - - 58.2 25.6 - - - 45.03 - -
LLaMA-AdapterV2 - 1328.4 356.4 - - - - - - 229.2 - 30.5 59.5
Qwen-VL-7B - - - 38.2 56.3 - - 5.5 - - - - 59.4
Qwen-VL-7B-Chat - 1487.6 360.7 60.6 58.2 - - 39.3 - 316.8 - 37.4 -
LLaV A1.5-7B 85.9 1510.7 - 64.3 58.6 63.4 30.5 16.4 - - 46.8 - 58.7
LLaV A1.5-13B 85.9 1531.3 295.4 67.7 61.6 70.7 35.4 26.5 - 307.2 55.5 32.62 62.1
SPHINX 90.8 1560.2 310.0 67.1 71.6 74.3 36.6 27.9 27.5 288.9 - 30.7 65.8
SPHINX-Tiny 82.2 1261.2 242.1 56.6 17.1 52.3 23.8 17.5 26.4 301.5 50.0 21.9 19.7
SPHINX-Intern2 86.9 1260.4 294.6 57.9 68.8 57.6 36.5 21.0 35.5 312.9 47.0 31.5 60.0
SPHINX-Plus 89.1 1457.7 283.6 71.0 74.8 71.7 47.9 25.6 36.8 282.1 57.4 39.5 68.6
SPHINX-MoE 89.6 1485.3 367.1 71.3 73.0 70.2 40.9 15.4 42.7 335.3 50.7 38.6 66.2
Table 2. Performance on 7 academic VQA benchmarks.
Method OKVQA VQA V2 VizWiz GQA SQA IVQA
BLIP-2 45.9 - 19.6 41.0 - 40.6
InstructBLIP - - 33.4 49.5 - 44.8
LLaMA-AdapterV2 49.6 70.7 39.8 45.1 - -
Shikra 47.2 77.4 - - - -
Fuyu-8B 60.6 74.2 - - - -
MiniGPT-v2 57.8 - 53.6 60.1 - 51.5
Qwen-VL-7B 58.6 79.5 35.2 59.3 67.1 -
Qwen-VL-7B-Chat 56.6 78.2 38.9 57.5 68.2 -
LLaV A1.5-7B - 78.5 50.0 62.0 66.8 -
LLaV A1.5-13B - 80.0 53.6 63.3 71.6 -
SPHINX 62.2 80.2 46.8 62.9 69.1 52.7
SPHINX-Tiny 53.6 74.7 49.2 58.0 21.5 40.7
SPHINX-Intern2 55.5 75.5 49.6 56.2 70.4 49.0
SPHINX-Plus - - 57.8 - 74.2 54.7
SPHINX-MoE 62.7 81.1 61.9 63.8 74.5 57.3
visual question answering (VQA) benchmarks such as
VQA V2 (Agrawal et al., 2015), GQV (Hudson & Man-
ning, 2019), OK-VQA (Marino et al., 2019), VizWiz (Gurari
et al., 2018), ScienceQA (Lu et al., 2022), IconQA (Lu et al.,
2021b) are presented in Table 2. SPHINX-X excels across
diverse visual question-answering benchmarks, showcasing
its state-of-the-art performance in general visual understand-
ing, relational reasoning, scientific contexts, and symbolic
visual reasoning. Moreover, we conduct experiments on text-
oriented VQA benchmarks such as TextVQA (Singh et al.,
2019), OCRVQA (Mishra et al., 2019), DocVQA (Mathew
et al., 2021b), ChartQA (Masry et al., 2022), AI2D (Kem-
bhavi et al., 2016), DeepForm (Svetlichnaya, 2020), In-
foVQA (Mathew et al., 2021a), TabFact (Chen et al.,
2019), VisualMRC (Tanaka et al., 2021). As shown in
Table 4, SPHINX-X achieves competitive performance on
text-related benchmarks with a limited portion of OCR data.
Visual grounding. To evaluate SPHINX-X’s ability to
precisely locate and comprehend referred objects or regions
within images, we conduct experiments on Referring Ex-
pression Comprehension (REC) benchmarks, including Re-
fCOCO (Kazemzadeh et al., 2014), RefCOCO+ (Mao et al.,
2015), and RefCOCOg (Mao et al., 2015). The resultsare presented in Table 5, SPHINX-X consistently outper-
forms the majority of state-of-the-art models, surpassing
even specialist model G-DINO-L (Liu et al., 2023c) and
other visual-language generalist models.
4.3. SPHINX-MoE on other MLLM Benchmarks
To investigate the ability of SPHINX-MoE more concretely
and locate its ability level among many developed MLLMs,
we evaluate SPHINX-MoE on some recently curated bench-
marks, which are listed below:
-MathVerse (Zhang et al., 2024d): A mathematical bench-
mark in visual contexts to explore the multi-modal dia-
gram interpretation and reasoning capabilities of MLLMs,
which annotate math problems into different versions for
fine-grained evaluation.
-SciVerse (Guo et al., 2024): A comprehensive scientific prob-
lem benchmark (physics, chemistry, and biology) in visual
contexts to reveal the domain-specific knowledge compre-
hension proficiency of MLLMs.
-MMVP (Tong et al., 2024): A benchmark specially crafted
to measure MLLMs’ visual understanding capability.
-HallusionBench (Guan et al., 2023): A benchmark to agnos-
tic MLLMs’ language hallucination and visual illusion.
-AesBench (Huang et al., 2024): An expert benchmark aim-
ing to comprehensively evaluate the aesthetic perception
capacities of MLLMs.
-MMMU (Yue et al., 2023a) & CMMMU (Ge et al., 2024):
An English and a Chinese benchmark, respectively, aiming
to solve massive multi-discipline tasks, which need college-
level subject knowledge and deliberate reasoning ability.
-ScreenSpot (Cheng et al., 2024): A benchmark across vari-
ous GUI platforms and designed to assess MLLM’s capability
to localize elements based on the human’s instructions.
The results for MathVerse and SciVerse are showcased
in Table 3. Our SPHINX-MoE attains the best perfor-
mance among open-source models, indicating superior math
problem-solving and scientific understanding capabilities.
The results of our model on other benchmarks are included
in Table 6. As we can see SPHINX-MoE performs well
on all benchmarks, so we can infer that (i) SPHINX-MoE
7
Page 8:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Table 3. Evaluation results of SPHINX-MoE on MathVerse (Zhang et al., 2024d) and SciVerse (Guo et al., 2024).
ModelMathVerse SciVerse
AllText
DominantText
LiteText
OnlyVision
IntensiveVision
DominantVision
OnlyAllText
OnlyKnowledge
LiteKnowledge
RichKnowledge
ProfessionalVision
DominantVision
Only
Open Source MLLM
LLaMA-Adapter V2 5.8 7.8 6.3 3.9 6.2 4.5 4.4 9.3 9.4 10.7 11.2 11.9 11.8 10.4
ImageBind-LLM 10.0 13.2 11.6 12.9 9.8 11.8 3.5 27.2 23.6 27.8 28.1 28.2 28.9 23.2
mPLUG-Owl2 10.3 11.6 11.4 13.8 11.1 9.4 8.0 - - - - - - -
MiniGPT-v2 10.9 13.2 12.7 15.3 11.1 11.3 6.4 30.0 28.5 31.4 29.6 30.5 31.6 26.9
LLaV A-NeXT 15.6 19.4 15.2 18.1 16.8 15.2 11.3 34.9 33.4 34.0 36.8 37.2 35.1 31.3
Closed Source MLLM
Qwen-VL-Plus 11.8 15.7 11.1 14.5 9.0 13.0 10.0 - - - - - - -
Gemini-Pro 24.1 26.3 23.5 27.3 23.0 22.3 22.2 - - - - - - -
Qwen-VL-Max 25.9 30.7 26.1 28.9 24.1 24.1 21.4 - - - - - - -
GPT-4V 41.0 54.7 41.4 48.7 34.9 34.4 31.6 - - - - - - -
SPHINX-MoE 15.6 22.2 16.4 18.3 14.8 12.6 9.1 37.3 41.1 38.9 38.8 41.3 36.3 31.4
Table 4. Performance on text-oriented VQA tasks. ‘†’ denotes
to use ground-truth OCR tokens during inference and training.
MethodText
VQAOCR
VQADoc
VQAChart
QAAI
2DDeep
FormInfo
VQAKLC WTQTab
FactVisual
MRC
Specialist models
Donut 43.5 - 67.5 41.8 - 61.6 11.6 30.0 18.8 54.6 93.9
UReader 57.6 - 65.4 59.3 - 49.5 42.2 32.8 29.4 67.6 221.7
Generalist models
BLIP-2 42.5†40.6 - - - - - - - - -
InstructBLIP 50.7†44.8 - - - - - - - - -
LLaMA-AdapterV2 37.4 - - - - - - - - - -
Qwen-VL-7B 63.8 75.7 65.1 65.7 62.3 - - - - - -
Qwen-VL-7B-Chat 61.5 70.5 62.6 62.6 62.6 - - - - - -
LLaV A1.5-7B 58.2 - - - - - - - - - -
LLaV A1.5-13B 61.3 - - - - - - - - - -
SPHINX 58.8 70.0 35.8 22.5 38.1 0 24.0 0 13.8 52.9 95.3
SPHINX-Tiny 57.8 60.3 53.0 34.1 24.6 11.8 26.3 22.2 15.3 51.1 147.5
SPHINX-Intern2 58.1 53.0 56.3 39.7 63.0 6.5 31.6 10.5 21.1 51.4 149.3
SPHINX-Plus 65.7 70.1 61.2 53.4 46.0 9.2 34.7 23.9 27.1 52.8 171.0
SPHINX-Plus-2K 70.6 68.9 71.6 55.1 47.4 23.2 39.1 31.1 31.1 54.0 178.4
SPHINX-MoE 68.0 64.8 68.4 55.0 55.6 20.7 41.8 25.5 29.9 52.7 184.4
has a better visual understanding ability and less language
hallucination than other competitors. (ii) SPHINX-MoE
can deal with the web and mobile domain data well. It
should be noted that on some tasks or metrics, SPHINX-
MoE performs even better than GPT-4V , e.g., MMVP and
AesP, AesE in AesBench. However, it is hard for SPHINX-
MoE to solve the multi-discipline tasks, i.e., the MMMU
and CMMMU benchmark. And we think this is due to
the lack of multi-modal multi-disciplinary data during the
training stage. Thus we would consider to involving more
multi-disciplinary data in SPHINX-MoE’s training.
4.4. Performance of SPHINX-Plus on Video Analysis
To further assess the visual comprehension capabilities of
our method, we conduct additional experiments on chal-
lenging video tasks. Since SPHINX-Plus is an image-based
MLLM and is not trained on any video data, we need toTable 5. Performance (Top-1 Accuracy@0.5) on Referring Ex-
pression Comprehension (REC) tasks.
MethodRefCOCO+ RefCOCO RefCOCOg
val test-A test-B val test-A test-B val-u test-u
Specialist models
UNINEXT 85.24 89.63 79.79 92.64 94.33 91.46 88.73 89.37
G-DINO-L 82.75 88.95 75.92 90.56 93.19 88.24 86.13 87.02
Generalist models
OFA-L 68.29 76.00 61.75 79.96 83.67 76.39 67.57 67.58
Shikra 13B 82.89 87.79 74.41 87.83 91.11 81.81 82.64 83.16
MiniGPT-v2-7B 79.97 85.12 74.45 88.69 91.65 85.33 84.44 84.66
MiniGPT-v2-7B
-Chat79.58 85.52 73.32 88.06 91.29 84.30 84.19 84.31
Qwen-VL-7B 83.12 88.25 77.21 89.36 92.26 85.34 85.58 85.48
Qwen-VL-7B
-Chat82.82 88.59 76.79 88.55 92.27 84.51 85.96 86.32
SPHINX 86.64 91.08 80.35 91.05 92.65 86.56 88.19 88.35
SPHINX-Tiny 71.34 78.49 63.71 82.89 86.89 77.91 78.50 78.86
SPHINX-Intern2 76.80 84.86 69.01 86.08 89.70 81.78 83.99 83.40
SPHINX-Plus 87.59 92.08 82.96 92.44 94.22 90.06 90.11 90.56
SPHINX-MoE 85.50 90.48 79.88 90.64 93.74 86.85 88.26 88.51
do additional processing on video inputs. To be specific,
we evenly sampled videos and selected the middle frame
as the representative frame for input into the model. We
conduct extensive experiments on Video-Bench (Ning et al.,
2023) which evaluates the performance of models across
three distinct capability levels: (i) video-exclusive under-
standing, (ii) prior knowledge-based question-answering,
and (iii) comprehension and decision-making.
As shown in Table 7, SPHINX-Plus, despite being an image-
based model, significantly outperforms existing models (Jin
et al., 2023; Su et al., 2023; Zhang et al., 2023a) specifi-
cally tailored for video tasks. Especially in the aspects of
video-exclusive understanding and prior knowledge-based
question-answering, SPHINX-Plus showcases outstanding
performance, signifying its proficiency in visual percep-
tion and knowledge extraction capabilities. In challenging
datasets such as MOT, SPHINX-Plus demonstrates slightly
8
Page 9:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Table 6. Evaluation results of SPHINX-MoE on other MLLM benchmarks.
Methods MMVPHallusionBench AesBenchMMMU-valCMMMU ScreenSpot
qAcc fAcc EasyaAcc HardaAcc aAcc AesP AesE AesA AesI val test Mobile Desktop Web
Open Source MLLM
LLaV A 6.0 - - - - - 62.43 64.68 45.96 1.125 - - - - - -
MiniGPT-4 12.7 8.79 10.12 31.87 27.67 35.78 41.93 39.35 38.57 0.999 26.8 - - 6.4 3.7 2.8
InstructBLIP 16.7 9.45 10.11 35.60 45.12 45.26 54.29 53.89 46.54 1.126 32.9 - - - - -
LLaV A-v1.5 24.7 10.55 24.86 49.67 29.77 46.94 66.32 68.32 45.46 1.157 36.4 - - - - -
CogAgent-Chat - - - - - - - - - - - 24.6 23.6 46.6 46.5 49.7
Qwen-VL-7B-Chat - 5.93 6.65 31.43 24.88 39.15 63.21 64.18 46.25 1.192 35.9 30.7 31.3 6.9 5.9 0
Closed Source MLLM
Gemini-Pro 40.7 - - - - - 71.99 71.37 49.38 1.222 47.9 - - - - -
GPT-4V 38.7 28.79 39.88 75.60 37.67 65.28 72.08 70.16 50.86 1.301 56.8 42.5 43.7 - - -
SPHINX-MoE 49.3 16.48 23.12 55.16 37.91 52.08 72.93 73.32 49.93 1.267 31.1 29.3 29.6 55.1 50.5 37.3
Table 7. Comparison with state-of-the-art methods on Video-Bench. ‘*’ denotes the QA-pairs are re-constructed or annotated by
Video-Bench. ‘V’ denotes a video training version of the model used.
Methods Avg.Video-Exclusive Understanding Prior Knowledge-based QA Comprehension and Decision-Making
Activitynet-QA MSVD-QA*MSRVTT-QA*TGIF-QA YouCook2*UCF-Cirme*MOT*TV-QA*MV-QA*NBA-QA*License Exam*Decision-Making*SQA3D*
Video-based MLLM
Video-LLaMA 31.8 39.9 41.2 34.1 31.3 28.9 27.6 16.7 24.8 32.4 26.2 30.6 49.1 31.2
mPLUG-OwlV32.7 41.5 42.5 36.3 31.7 27.1 22.8 27.8 24.0 30.2 25.1 33.3 51.0 32.0
VideoChat 34.6 44.6 42.2 37.4 33.7 27.7 22.4 27.8 26.2 34.1 28.6 38.9 55.4 31.4
Chat-UniVi 35.2 49.0 48.6 41.7 41.3 29.0 28.3 16.7 23.1 33.6 25.7 38.9 53.1 29.1
PandaGPT 36.7 45.0 50.4 44.6 29.7 33.0 33.0 16.7 27.9 37.1 31.1 41.7 56.0 30.8
OtterV37.1 44.3 55.0 47.0 34.3 32.7 22.4 16.7 27.7 37.1 34.3 52.8 48.7 29.7
Video-ChatGPT 38.3 46.6 57.5 46.3 35.6 34.8 24.1 27.8 28.8 36.5 22.5 41.7 58.2 37.2
Image-based MLLM
SPHINX 39.0 50.1 56.7 45.4 42.8 37.0 25.2 5.6 29.8 33.3 30.9 50.0 52.8 47.7
SPHINX-Plus 45.1 53.1 68.5 54.0 53.4 42.0 27.6 11.1 36.5 44.0 45.0 47.2 55.6 48.8
lower performance compared to existing state-of-the-art
methods (Maaz et al., 2023; Li et al., 2023d). We attribute
this to the need to model timing relationships in videos.
SPHINX-Plus has not been fine-tuned by any video data, so
its performance marginally underperforms others.
4.5. Demonstrations of SPHINX-X
In Figure 2, the demonstrates of SPHINX-X indicate that
our models can 1) conduct fine-grained object detection in
high-resolution images by the proposed sub-image division
strategy; 2) conduct text spotting with accurate content and
positions; 3) engage in bilingual image-based conversations,
and generate coherent, accurate, and detailed Chinese de-
scriptions for synthetic images; 4) generate accurate code
for visual programming based on the precise understanding
of the given plot; 5) analyze App screenshots based on the
functional description and output the corresponding bound-
ing box; 6) accurately interpret geometric questions from
images, thanks to the math and extensive OCR datasets in-
cluded in our training corpus; and 7) estimate the correct
pose with rigorous body key points.
In the appendix (Figure 7), we respectively show the Set-of-
marks (SoM) prompting and OCR understanding capabili-
ties of SPHINX-X. With our curated SoM dataset, SPHINX-
X can well understand the marks given in the prompt, i.e., acat and bear bottle, and analyze the appearance and relations
of designated objects. By the training of OCR-intensive data,
our model can conduct accurate document layout detection
and character recognition.
5. Conclusion
In this paper, we introduce SPHINX-X, a series of MLLMs
for multi-purpose multi-modal instruction tuning with LLM
parameters ranging from 1B to 8 ×7B. On top of the orig-
inal SPHINX, we propose three aspects of improvements,
i.e., removing redundant visual encoders, bypassing fully-
padded sub-images with skip tokens, and simplifying multi-
stage training into a one-stage all-in-one paradigm. We
also curate a large-scale multi-domain dataset for MLLM
training, which contains a wide range of public datasets
and our constructed targeted data. Extensive benchmarks
and evaluations demonstrate the superior performance and
generalization capacity of SPHINX-X. We hope our work
may cast a light on future MLLM research.
9
Page 10:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Acknowledgements
This project is funded in part by National Key R&D Pro-
gram of China Project 2022ZD0161100, by the Centre for
Perceptual and Interactive Intelligence (CPII) Ltd. under the
Innovation and Technology Commission (ITC)’s InnoHK,
by General Research Fund of Hong Kong RGC Project
14204021, and by National Natural Science Foundation of
China (Grant No. 62206272). Hongsheng Li is a PI of CPII
under the InnoHK.
Impact Statement
The SPHINX-X Multi-modality Large Language Model
series has the potential to impact society in several ways:
Enhanced Multimodal AI Applications : SPHINX-X
could lead to the development of more sophisticated AI sys-
tems capable of understanding and interacting with both text
and visual input. This can improve services like automated
translations, image recognition, and assistive technologies
for individuals with disabilities.
Inclusivity and Language Diversity : By training on a di-
verse, multi-domain, and multi-modal dataset, the model
may offer broader language support, which can bridge com-
munication gaps and foster inclusivity.
Ethical and Bias Considerations : The expansive dataset
used for training must be carefully curated to avoid perpet-
uating biases, ensuring that the model’s responses are fair
and ethical.
Misuse Risks : The misuse of MLLMs for generating de-
ceptive content is a risk, underscoring the need for robust
governance and ethical usage frameworks.
The responsible deployment of SPHINX-X requires care-
ful consideration of these potential impacts to maximize
benefits and minimize negative consequences.
References
Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L.,
Parikh, D., and Batra, D. Vqa: Visual question answering.
International Journal of Computer Vision , 123:4 – 31,
2015.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I.,
Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds,
M., et al. Flamingo: a visual language model for few-shot
learning. Advances in Neural Information Processing
Systems , 35:23716–23736, 2022.
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin,
J., Zhou, C., and Zhou, J. Qwen-vl: A versatile vision-
language model for understanding, localization, text read-
ing, and beyond. arXiv preprint arXiv:2308.12966 , 2023.Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K.,
Aggarwal, K., Som, S., Piao, S., and Wei, F. Vlmo:
Unified vision-language pre-training with mixture-of-
modality-experts. In Koyejo, S., Mohamed, S., Agarwal,
A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in
Neural Information Processing , 2022.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. Language models are few-shot learners.
InAdvances in neural information processing systems ,
pp. 1877–1901, 2020.
Cai, R., Song, Z., Guan, D., Chen, Z., Luo, X., Yi, C.,
and Kot, A. Benchlmm: Benchmarking cross-style
visual capability of large multimodal models. ArXiv ,
abs/2312.02896, 2023a.
Cai, R., Song, Z., Guan, D., Chen, Z., Luo, X., Yi, C., and
Kot, A. Benchlmm: Benchmarking cross-style visual
capability of large multimodal models. arXiv preprint
arXiv:2312.02896 , 2023b.
Cao, Y ., Xu, X., Sun, C., Huang, X., and Shen, W. Towards
generic anomaly detection and understanding: Large-
scale visual-linguistic model (gpt-4v) takes the lead.
arXiv preprint arXiv:2311.02782 , 2023.
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and
Zhao, R. Shikra: Unleashing multimodal llm’s referential
dialogue magic. arXiv preprint arXiv:2306.15195 , 2023a.
Chen, L., Li, J., wen Dong, X., Zhang, P., He, C.,
Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improv-
ing large multi-modal models with better captions.
ArXiv , abs/2311.12793, 2023b. URL https:
//api.semanticscholar.org/CorpusID:
265308687 .
Chen, W., Wang, H., Chen, J., Zhang, Y ., Wang, H., LI, S.,
Zhou, X., and Wang, W. Y . Tabfact: A large-scale dataset
for table-based fact verification. ArXiv , abs/1909.02164,
2019.
Cheng, K., Sun, Q., Chu, Y ., Xu, F., Li, Y ., Zhang, J., and
Wu, Z. Seeclick: Harnessing gui grounding for advanced
visual gui agents. arXiv preprint arXiv:2401.10935 , 2024.
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z.,
Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y .,
Gonzalez, J. E., Stoica, I., and Xing, E. P. Vi-
cuna: An open-source chatbot impressing gpt-4 with
90%* chatgpt quality. https://lmsys.org/blog/
2023-03-30-vicuna/ , March 2023.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra,
G., Roberts, A., Barham, P., Chung, H. W., Sutton, C.,
Gehrmann, S., et al. Palm: Scaling language modeling
with pathways. arXiv preprint arXiv:2204.02311 , 2022.
10
Page 11:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Contributors, O. Opencompass: A universal evaluation
platform for foundation models. https://github.
com/open-compass/opencompass , 2023.
Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang,
W., Li, B., Fung, P., and Hoi, S. Instructblip: Towards
general-purpose vision-language models with instruction
tuning, 2023.
Ding, N., Chen, Y ., Xu, B., Qin, Y ., Zheng, Z., Hu, S., Liu,
Z., Sun, M., and Zhou, B. Enhancing chat language mod-
els by scaling high-quality instructional conversations.
arXiv preprint arXiv:2305.14233 , 2023.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.
An image is worth 16x16 words: Transformers for image
recognition at scale. ArXiv , abs/2010.11929, 2020.
Fedus, W., Zoph, B., and Shazeer, N. Switch transform-
ers: Scaling to trillion parameter models with simple
and efficient sparsity. The Journal of Machine Learning
Research , 23(1):5232–5270, 2022.
Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Qiu,
Z., Lin, W., Yang, J., Zheng, X., et al. Mme: A compre-
hensive evaluation benchmark for multimodal large lan-
guage models. arXiv preprint arXiv:2306.13394 , 2023a.
Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang,
J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R. Mme:
A comprehensive evaluation benchmark for multimodal
large language models. arXiv preprint arXiv:2306.13394 ,
2023b.
Fu, C., Zhang, R., Lin, H., Wang, Z., Gao, T., Luo, Y .,
Huang, Y ., Zhang, Z., Qiu, L., Ye, G., et al. A challenger
to gpt-4v? early explorations of gemini in visual expertise.
arXiv preprint arXiv:2312.12436 , 2023c.
Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A.,
Zhang, W., Lu, P., He, C., Yue, X., Li, H., and Qiao, Y .
Llama-adapter v2: Parameter-efficient visual instruction
model. arXiv preprint arXiv:2304.15010 , 2023.
Ge, Z., Xinrun, D., Bei, C., Yiming, L., Tongxu, L.,
Tianyu, Z., Kang, Z., Yuyang, C., Chunpu, X., Shuyue,
G., Haoran, Z., Xingwei, Q., Junjie, W., Ruibin, Y .,
Yizhi, L., Zekun, W., Yudong, L., Yu-Hsuan, T., Fengji,
Z., Chenghua, L., Wenhao, H., Wenhu, C., and Jie,
F. Cmmmu: A chinese massive multi-discipline mul-
timodal understanding benchmark. arXiv preprint
arXiv:2401.20847 , 2024.
Gemini Team, G. Gemini: a family of highly capable multi-
modal models. arXiv preprint arXiv:2312.11805 , 2023.Geng, H., Wei, S., Deng, C., Shen, B., Wang, H., and
Guibas, L. Sage: Bridging semantic and actionable parts
for generalizable articulated-object manipulation under
language instructions. arXiv preprint arXiv:2312.01307 ,
2023.
Ghosal, D., Chia, Y . K., Majumder, N., and Poria, S. Fla-
cuna: Unleashing the problem solving power of vicuna
using flan fine-tuning, 2023.
Guan, T., Liu, F., Li, X. W. R. X. Z., Wang, X. L. X., Yacoob,
L. C. F. H. Y ., and Zhou, D. M. T. Hallusionbench: An
advanced diagnostic suite for entangled language halluci-
nation & visual illusion in large vision-language models.
arXiv e-prints , pp. arXiv–2310, 2023.
Guo, Z., Zhang, R., Zhu, X., Tang, Y ., Ma, X., Han, J.,
Chen, K., Gao, P., Li, X., Li, H., et al. Point-bind &
point-llm: Aligning point cloud with multi-modality for
3d understanding, generation, and instruction following.
arXiv preprint arXiv:2309.00615 , 2023.
Guo, Z., Zhang, R., Chen, H., Gao, J., Gao, P., Li, H.,
and Heng, P.-A. Sciverse. arXiv preprint , 2024. URL
https://sciverse-cuhk.github.io/ .
Gupta, A., Doll ´ar, P., and Girshick, R. B. Lvis: A dataset for
large vocabulary instance segmentation. 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR) , pp. 5351–5359, 2019.
Gurari, D., Li, Q., Stangl, A., Guo, A., Lin, C., Grauman,
K., Luo, J., and Bigham, J. P. Vizwiz grand challenge:
Answering visual questions from blind people. 2018
IEEE/CVF Conference on Computer Vision and Pattern
Recognition , pp. 3608–3617, 2018.
Han, J., Zhang, R., Shao, W., Gao, P., Xu, P., Xiao, H.,
Zhang, K., Liu, C., Wen, S., Guo, Z., et al. Imagebind-
llm: Multi-modality instruction tuning. arXiv preprint
arXiv:2309.03905 , 2023a.
Han, X., You, Q., Liu, Y ., Chen, W., Zheng, H., Mrini, K.,
Lin, X., Wang, Y ., Zhai, B., Yuan, J., Wang, H., and
Yang, H. Infimm-eval: Complex open-ended reasoning
evaluation for multi-modal large language models. 2023b.
He, H., Yao, W., Ma, K., Yu, W., Dai, Y ., Zhang, H., Lan,
Z., and Yu, D. Webvoyager: Building an end-to-end
web agent with large multimodal models. arXiv preprint
arXiv:2401.13919 , 2024.
Hu, A., Shi, Y ., Xu, H., Ye, J., Ye, Q., Yan, M., Li, C.,
Qian, Q., Zhang, J., and Huang, F. mplug-paperowl:
Scientific diagram analysis with the multimodal large
language model. arXiv preprint arXiv:2311.18248 , 2023.
11
Page 12:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Huang, Y ., Yuan, Q., Sheng, X., Yang, Z., Wu, H., Chen, P.,
Yang, Y ., Li, L., and Lin, W. Aesbench: An expert bench-
mark for multimodal large language models on image
aesthetics perception. arXiv preprint arXiv:2401.08276 ,
2024.
Hudson, D. A. and Manning, C. D. Gqa: A new dataset for
real-world visual reasoning and compositional question
answering. 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) , pp. 6693–6702,
2019.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G.,
Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint
arXiv:2310.06825 , 2023.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary,
B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna,
E. B., Bressand, F., et al. Mixtral of experts. arXiv
preprint arXiv:2401.04088 , 2024a.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary,
B., Bamford, C., Chaplot, D. S., de Las Casas, D., Hanna,
E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G.,
Lavaud, L. R., Saulnier, L., Lachaux, M., Stock, P., Sub-
ramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet,
T., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E.
Mixtral of experts. Arxiv 2401.04088 , 2024b.
Jin, P., Takanobu, R., Zhang, C., Cao, X., and Yuan, L.
Chat-univi: Unified visual representation empowers large
language models with image and video understanding.
arXiv preprint arXiv:2311.08046 , 2023.
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L.,
Zitnick, C. L., and Girshick, R. B. Clevr: A diagnostic
dataset for compositional language and elementary visual
reasoning. 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) , pp. 1988–1997, 2016.
Kafle, K., Cohen, S. D., Price, B. L., and Kanan, C.
Dvqa: Understanding data visualizations via question
answering. 2018 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition , pp. 5648–5656,
2018. URL https://api.semanticscholar.
org/CorpusID:4445015 .
Kazemzadeh, S., Ordonez, V ., andre Matten, M., and Berg,
T. L. Referitgame: Referring to objects in photographs
of natural scenes. In Conference on Empirical Methods
in Natural Language Processing , 2014.
Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi,
H., and Farhadi, A. A diagram is worth a dozen images.
ArXiv , abs/1603.07396, 2016. URL https://api.
semanticscholar.org/CorpusID:2682274 .Kim, G., Hong, T., Yim, M., Park, J., Yim, J., Hwang,
W., Yun, S., Han, D., and Park, S. Donut: Document
understanding transformer without ocr. arXiv preprint
arXiv:2111.15664 , 7:15, 2021.
Krishna, R., Zhu, Y ., Groth, O., Johnson, J., Hata, K.,
Kravitz, J., Chen, S., Kalantidis, Y ., Li, L.-J., Shamma,
D. A., et al. Visual genome: Connecting language and
vision using crowdsourced dense image annotations. In-
ternational journal of computer vision , 123:32–73, 2017.
Kuznetsova, A., Rom, H., Alldrin, N. G., Uijlings, J. R. R.,
Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci,
M., Kolesnikov, A., Duerig, T., and Ferrari, V . The open
images dataset v4. International Journal of Computer
Vision , 128:1956 – 1981, 2018.
Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y .,
Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling
giant models with conditional computation and automatic
sharding. arXiv preprint arXiv:2006.16668 , 2020.
Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., and Shan, Y .
Seed-bench: Benchmarking multimodal llms with gener-
ative comprehension. ArXiv , abs/2307.16125, 2023a.
Li, B., Zhang, K., Zhang, H., Guo, D., Zhang, R.,
Li, F., Zhang, Y ., Liu, Z., and Li, C. Llava-
next: Stronger llms supercharge multimodal
capabilities in the wild, May 2024a. URL
https://llava-vl.github.io/blog/
2024-05-10-llava-next-stronger-llms/ .
Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W.,
Ma, Z., and Li, C. Llava-next: Tackling multi-image,
video, and 3d in large multimodal models, June 2024b.
URL https://llava-vl.github.io/blog/
2024-06-16-llava-next-interleave/ .
Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping
language-image pre-training for unified vision-language
understanding and generation. In International Confer-
ence on Machine Learning , pp. 12888–12900. PMLR,
2022.
Li, J., Li, D., Savarese, S., and Hoi, S. BLIP-2: Boot-
strapping language-image pre-training with frozen image
encoders and large language models. In Krause, A., Brun-
skill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett,
J. (eds.), Proceedings of the 40th International Confer-
ence on Machine Learning , volume 202 of Proceedings
of Machine Learning Research , pp. 19730–19742. PMLR,
23–29 Jul 2023b. URL https://proceedings.
mlr.press/v202/li23q.html .
Li, J., Li, D., Savarese, S., and Hoi, S. C. H. Blip-
2: Bootstrapping language-image pre-training with
12
Page 13:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
frozen image encoders and large language models.
InInternational Conference on Machine Learning ,
2023c. URL https://api.semanticscholar.
org/CorpusID:256390509 .
Li, K., He, Y ., Wang, Y ., Li, Y ., Wang, W., Luo, P., Wang,
Y ., Wang, L., and Qiao, Y . Videochat: Chat-centric video
understanding. arXiv preprint arXiv:2305.06355 , 2023d.
Li, K., Wang, Y ., He, Y ., Li, Y ., Wang, Y ., Liu, Y ., Wang, Z.,
Xu, J., Chen, G., Luo, P., et al. Mvbench: A comprehen-
sive multi-modal video understanding benchmark. arXiv
preprint arXiv:2311.17005 , 2023e.
Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W. X., and Wen,
J.-R. Evaluating object hallucination in large vision-
language models. arXiv preprint arXiv:2305.10355 ,
2023f.
Lian, W., Goodson, B., Pentland, E., Cook, A., V ong, C.,
and ”Teknium”. Openorca: An open dataset of gpt
augmented flan reasoning traces. https://https:
//huggingface.co/Open-Orca/OpenOrca ,
2023.
Lin, T.-Y ., Maire, M., Belongie, S. J., Hays, J., Perona, P.,
Ramanan, D., Doll ´ar, P., and Zitnick, C. L. Microsoft
coco: Common objects in context. In European Confer-
ence on Computer Vision , 2014.
Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu,
H., Lin, C., Shao, W., Chen, K., et al. Sphinx: The
joint mixing of weights, tasks, and visual embeddings
for multi-modal large language models. arXiv preprint
arXiv:2311.07575 , 2023.
Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines
with visual instruction tuning, 2023a.
Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction
tuning. In NeurIPS , 2023b.
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., yue Li,
C., Yang, J., Su, H., Zhu, J.-J., and Zhang, L. Grounding
dino: Marrying dino with grounded pre-training for open-
set object detection. ArXiv , abs/2303.05499, 2023c.
Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W.,
Yuan, Y ., Wang, J., He, C., Liu, Z., et al. Mmbench: Is
your multi-modal model an all-around player? arXiv
preprint arXiv:2307.06281 , 2023d.
Liu, Z., Mao, H., Wu, C.-Y ., Feichtenhofer, C., Darrell, T.,
and Xie, S. A convnet for the 2020s. In Proceedings of
the IEEE/CVF conference on computer vision and pattern
recognition , pp. 11976–11986, 2022.Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S.,
Liang, X., and Zhu, S.-C. Inter-gps: Inter-
pretable geometry problem solving with formal lan-
guage and symbolic reasoning. In Annual Meet-
ing of the Association for Computational Linguistics ,
2021a. URL https://api.semanticscholar.
org/CorpusID:234337054 .
Lu, P., Qiu, L., Chen, J., Xia, T., Zhao, Y ., Zhang, W., Yu,
Z., Liang, X., and Zhu, S.-C. Iconqa: A new benchmark
for abstract diagram understanding and visual language
reasoning. ArXiv , abs/2110.13214, 2021b.
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu,
S.-C., Tafjord, O., Clark, P., and Kalyan, A. Learn to
explain: Multimodal reasoning via thought chains for sci-
ence question answering. ArXiv , abs/2209.09513, 2022.
Lu, P., Bansal, H., Xia, T., Liu, J., yue Li, C., Hajishirzi,
H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J.
Mathvista: Evaluating math reasoning in visual contexts
with gpt-4v, bard, and other large multimodal models.
ArXiv , abs/2310.02255, 2023.
Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C.,
Ma, J., Lin, Q., and Jiang, D. Wizardcoder: Empowering
code large language models with evol-instruct. arXiv
preprint arXiv:2306.08568 , 2023.
Lv, T., Huang, Y ., Chen, J., Cui, L., Ma, S., Chang,
Y ., Huang, S., Wang, W., Dong, L., Luo, W., et al.
Kosmos-2.5: A multimodal literate model. arXiv preprint
arXiv:2309.11419 , 2023.
Maaz, M., Rasheed, H., Khan, S., and Khan, F. S.
Video-chatgpt: Towards detailed video understanding
via large vision and language models. arXiv preprint
arXiv:2306.05424 , 2023.
Mao, J., Huang, J., Toshev, A., Camburu, O.-M., Yuille,
A. L., and Murphy, K. P. Generation and comprehen-
sion of unambiguous object descriptions. 2016 IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR) , pp. 11–20, 2015.
Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. Ok-
vqa: A visual question answering benchmark requiring
external knowledge. 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) , pp.
3190–3199, 2019.
Masry, A., Long, D., Tan, J. Q., Joty, S., and
Hoque, E. ChartQA: A benchmark for question
answering about charts with visual and logical rea-
soning. In Findings of the Association for Com-
putational Linguistics: ACL 2022 , pp. 2263–2279,
Dublin, Ireland, May 2022. Association for Computa-
tional Linguistics. doi: 10.18653/v1/2022.findings-acl.
13
Page 14:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
177. URL https://aclanthology.org/2022.
findings-acl.177 .
Mathew, M., Bagal, V ., Tito, R. P., Karatzas, D., Valveny, E.,
and Jawahar, C. Infographicvqa. 2022 IEEE/CVF Winter
Conference on Applications of Computer Vision (WACV) ,
pp. 2582–2591, 2021a.
Mathew, M., Karatzas, D., and Jawahar, C. Docvqa: A
dataset for vqa on document images. In Proceedings
of the IEEE/CVF winter conference on applications of
computer vision , pp. 2200–2209, 2021b.
Meng, F., Shao, W., Lu, Q., Gao, P., Zhang, K., Qiao, Y .,
and Luo, P. Chartassisstant: A universal chart multimodal
language model via chart-to-table pre-training and multi-
task instruction tuning. arXiv preprint arXiv:2401.02384 ,
2024.
Microsoft. Phi-2, 2023. URL https://huggingface.
co/microsoft/phi-2 .
Mishra, A., Shekhar, S., Singh, A. K., and Chakraborty, A.
Ocr-vqa: Visual question answering by reading text in
images. 2019 International Conference on Document
Analysis and Recognition (ICDAR) , pp. 947–952, 2019.
Ning, M., Zhu, B., Xie, Y ., Lin, B., Cui, J., Yuan, L., Chen,
D., and Yuan, L. Video-bench: A comprehensive bench-
mark and toolkit for evaluating video-based large lan-
guage models. arXiv preprint arXiv:2311.16103 , 2023.
OpenAI. GPT-4V(ision) system card, 2023.
URL https://openai.com/research/
gpt-4v-system-card .
Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec,
M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-
Nouby, A., et al. Dinov2: Learning robust visual features
without supervision. arXiv preprint arXiv:2304.07193 ,
2023.
Ordonez, V ., Kulkarni, G., and Berg, T. Im2text: Describing
images using 1 million captioned photographs. In
Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and
Weinberger, K. (eds.), Advances in Neural Information
Processing Systems , volume 24. Curran Associates, Inc.,
2011. URL https://proceedings.neurips.
cc/paper_files/paper/2011/file/
5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.
pdf.
Pasupat, P. and Liang, P. Compositional semantic
parsing on semi-structured tables. In Annual Meet-
ing of the Association for Computational Linguistics ,
2015. URL https://api.semanticscholar.
org/CorpusID:9027681 .Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
Sutskever, I., et al. Language models are unsupervised
multitask learners. OpenAI blog , 1(8):9, 2019.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh,
G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P.,
Clark, J., Krueger, G., and Sutskever, I. Learning trans-
ferable visual models from natural language supervi-
sion. In International Conference on Machine Learning ,
2021. URL https://api.semanticscholar.
org/CorpusID:231591445 .
Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y . Zero:
Memory optimizations toward training trillion parameter
models. In SC20: International Conference for High Per-
formance Computing, Networking, Storage and Analysis ,
pp. 1–16. IEEE, 2020.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
M. S., Berg, A. C., and Fei-Fei, L. Imagenet large scale
visual recognition challenge. International Journal of
Computer Vision , 115:211 – 252, 2014.
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk,
R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and
Komatsuzaki, A. Laion-400m: Open dataset of clip-
filtered 400 million image-text pairs. arXiv preprint
arXiv:2111.02114 , 2021.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.,
Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis,
C., Wortsman, M., et al. Laion-5b: An open large-scale
dataset for training next generation image-text models.
Advances in Neural Information Processing Systems , 35:
25278–25294, 2022.
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li,
J., and Sun, J. Objects365: A large-scale, high-quality
dataset for object detection. 2019 IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV) , pp. 8429–
8438, 2019.
Shao, W., Hu, Y ., Gao, P., Lei, M., Zhang, K., Meng, F.,
Xu, P., Huang, S., Li, H., Qiao, Y ., et al. Tiny lvlm-ehub:
Early multimodal experiments with bard. arXiv preprint
arXiv:2308.03729 , 2023.
Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con-
ceptual captions: A cleaned, hypernymed, image alt-text
dataset for automatic image captioning. In Proceedings
of the 56th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers) , pp. 2556–
2565, 2018.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
Q., Hinton, G., and Dean, J. Outrageously large neural
14
Page 15:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
networks: The sparsely-gated mixture-of-experts layer.
arXiv preprint arXiv:1701.06538 , 2017.
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper,
J., and Catanzaro, B. Megatron-lm: Training multi-
billion parameter language models using model paral-
lelism. arXiv preprint arXiv:1909.08053 , 2019.
Singh, A., Natarajan, V ., Shah, M., Jiang, Y ., Chen, X.,
Batra, D., Parikh, D., and Rohrbach, M. Towards vqa
models that can read. 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) , pp.
8309–8318, 2019.
Stanislawek, T., Grali’nski, F., Wr’oblewska, A., Lipi’nski,
D., Kaliska, A., Rosalska, P., Topolski, B., and Biecek, P.
Kleister: Key information extraction datasets involving
long documents with complex layouts. In IEEE Interna-
tional Conference on Document Analysis and Recogni-
tion, 2021.
Su, Y ., Lan, T., Li, H., Xu, J., Wang, Y ., and Cai, D.
Pandagpt: One model to instruction-follow them all.
arXiv preprint arXiv:2305.16355 , 2023.
Svetlichnaya, S. Deepform, 2020. URL
https://wandb.ai/stacey/deepform_
v1/reports/DeepForm-Understand-/
Structured-Documents-at-Scale .
Tanaka, R., Nishida, K., and Yoshida, S. Visualmrc:
Machine reading comprehension on document images.
ArXiv , abs/2101.11272, 2021.
Team, I. Internlm: A multilingual language model with
progressively enhanced capabilities, 2023.
Tong, S., Liu, Z., Zhai, Y ., Ma, Y ., LeCun, Y ., and Xie, S.
Eyes wide shut? exploring the visual shortcomings of
multimodal llms. Arxiv 2401.06209 , 2024.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models. arXiv preprint arXiv:2302.13971 , 2023a.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine-
tuned chat models. arXiv preprint arXiv:2307.09288 ,
2023b.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in neural information
processing systems , 2017.Wang, J., Meng, L., Weng, Z., He, B., Wu, Z., and Jiang,
Y .-G. To see is to believe: Prompting gpt-4v for bet-
ter visual instruction tuning. ArXiv , abs/2311.07574,
2023a. URL https://api.semanticscholar.
org/CorpusID:265150580 .
Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G.,
Luo, P., Lu, T., Zhou, J., Qiao, Y ., et al. Visionllm: Large
language model is also an open-ended decoder for vision-
centric tasks. arXiv preprint arXiv:2305.11175 , 2023b.
Wen, L., Fu, D., Li, X., Cai, X., Ma, T., Cai, P., Dou, M.,
Shi, B., He, L., and Qiao, Y . Dilu: A knowledge-driven
approach to autonomous driving with large language mod-
els.arXiv preprint arXiv:2309.16292 , 2023.
Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E.,
Ili´c, S., Hesslow, D., Castagn ´e, R., Luccioni, A. S., Yvon,
F., et al. Bloom: A 176b-parameter open-access multilin-
gual language model. arXiv preprint arXiv:2211.05100 ,
2022.
Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin,
C., Lv, C., Pan, D., Wang, D., Yan, D., et al. Baichuan
2: Open large-scale language models. arXiv preprint
arXiv:2309.10305 , 2023a.
Yang, J., Zeng, A., Zhang, R., and Zhang, L. Unipose:
Detecting any keypoints. ArXiv , abs/2310.08530, 2023b.
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., and Gao, J.
Set-of-mark prompting unleashes extraordinary visual
grounding in gpt-4v. arXiv preprint arXiv:2310.11441 ,
2023c.
Yang, S., Liu, J., Zhang, R., Pan, M., Guo, Z., Li, X.,
Chen, Z., Gao, P., Guo, Y ., and Zhang, S. Lidar-llm:
Exploring the potential of large language models for 3d
lidar understanding. arXiv preprint arXiv:2312.14074 ,
2023d.
Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., and
Wang, L. The dawn of lmms: Preliminary explorations
with gpt-4v (ision). arXiv preprint arXiv:2309.17421 , 9
(1):1, 2023e.
Yang, Z., Liu, J., Han, Y ., Chen, X., Huang, Z., Fu, B.,
and Yu, G. Appagent: Multimodal agents as smartphone
users. arXiv preprint arXiv:2312.13771 , 2023f.
Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Dan, Y ., Zhao,
C., Xu, G., Li, C., Tian, J., et al. mplug-docowl: Modu-
larized multimodal large language model for document
understanding. arXiv preprint arXiv:2307.02499 , 2023a.
Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y ., Wang, J.,
Hu, A., Shi, P., Shi, Y ., Jiang, C., Li, C., Xu, Y ., Chen, H.,
Tian, J., Qian, Q., Zhang, J., and Huang, F. mplug-owl:
15
Page 16:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Modularization empowers large language models with
multimodality, 2023b.
Ye, Q., Xu, H., Ye, J., Yan, M., Hu, A., Liu, H., Qian,
Q., Zhang, J., Huang, F., and Zhou, J. mplug-owl2:
Revolutionizing multi-modal large language model with
modality collaboration, 2023c.
Yim, M., Kim, Y ., Cho, H.-C., and Park, S. Synthtiger:
Synthetic text image generator towards better text recog-
nition models. In International Conference on Document
Analysis and Recognition , pp. 109–124. Springer, 2021.
Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y ., Kwok,
J. T., Li, Z., Weller, A., and Liu, W. Metamath: Boot-
strap your own mathematical questions for large language
models. arXiv preprint arXiv:2309.12284 , 2023a.
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang,
X., and Wang, L. Mm-vet: Evaluating large multimodal
models for integrated capabilities. ArXiv , abs/2308.02490,
2023b.
Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G.,
Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B.,
Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu,
Y ., Huang, W., Sun, H., Su, Y ., and Chen, W. Mmmu:
A massive multi-discipline multimodal understanding
and reasoning benchmark for expert agi. arXiv preprint
arXiv:2311.16502 , 2023a.
Yue, X., Qu, X., Zhang, G., Fu, Y ., Huang, W., Sun, H., Su,
Y ., and Chen, W. Mammoth: Building math generalist
models through hybrid instruction tuning. arXiv preprint
arXiv:2309.05653 , 2023b.
Zhang, H., Li, X., and Bing, L. Video-llama: An instruction-
tuned audio-visual language model for video understand-
ing.arXiv preprint arXiv:2306.02858 , 2023a.
Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyl-
lama: An open-source small language model.
ArXiv , abs/2401.02385, 2024a. URL https:
//api.semanticscholar.org/CorpusID:
266755802 .
Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama:
An open-source small language model. arXiv preprint
arXiv:2401.02385 , 2024b.
Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B.,
Qiao, Y ., Gao, P., and Li, H. Pointclip: Point cloud
understanding by clip. In CVPR 2022 , 2022a.
Zhang, R., Hu, X., Li, B., Huang, S., Deng, H., Li, H., Qiao,
Y ., and Gao, P. Prompt, generate, then cache: Cascade of
foundation models makes strong few-shot learners. CVPR
2023 , 2023b.Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P.,
Li, H., Gao, P., and Qiao, Y . LLaMA-adapter: Effi-
cient fine-tuning of large language models with zero-
initialized attention. In The Twelfth International Confer-
ence on Learning Representations , 2024c. URL https:
//openreview.net/forum?id=d4UiXAHN2W .
Zhang, R., Jiang, D., Zhang, Y ., Lin, H., Guo, Z., Qiu, P.,
Zhou, A., Lu, P., Chang, K.-W., Gao, P., et al. Mathverse:
Does your multi-modal llm truly see the diagrams in
visual math problems? arXiv preprint arXiv:2403.14624 ,
2024d.
Zhang, R., Wei, X., Jiang, D., Zhang, Y ., Guo, Z., Tong,
C., Liu, J., Zhou, A., Zhang, S., Gao, P., et al. Mavis:
Mathematical visual instruction tuning. arXiv , 2024e.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V .,
et al. Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068 , 2022b.
Zhang, Y ., Zhang, R., Gu, J., Zhou, Y ., Lipka, N., Yang, D.,
and Sun, T. Llavar: Enhanced visual instruction tuning for
text-rich image understanding. ArXiv , abs/2306.17107,
2023c. URL https://api.semanticscholar.
org/CorpusID:259287523 .
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
Minigpt-4: Enhancing vision-language understanding
with advanced large language models. arXiv preprint
arXiv:2304.10592 , 2023a.
Zhu, X., Zhang, R., He, B., Guo, Z., Zeng, Z., Qin, Z.,
Zhang, S., and Gao, P. Pointclip v2: Prompting clip and
gpt for powerful 3d open-world learning. ICCV 2023 ,
2023b.
16
Page 17:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
A. Appendix
A.1.Analysis of Routing Mechanisms in SPHINX-MoE
A.1.1. I NFERENCE WITH DIFFERENT NUMBERS OF
ACTIVATING EXPERTS
For SPHINX-MoE, the LLM backbone is based on Mixtral-
8×7B (Jiang et al., 2024b), which is a mixture-of-experts-
based large language model. Thus, during the inference
time, only some of the experts will be activated when dealing
with each token. In the training stage of SPHINX-MoE,
only two of the eight experts will be activated, so we set the
default number of activating experts to 2 when inference.
To investigate how the activating experts’ amount will affect
the inference performance, we change it from one to eight,
and the results are shown in Figure 5.
As we can see, on most datasets, i.e., ScienceQA (Lu et al.,
2022), TextQA (Singh et al., 2019), RefCOCO (Lin et al.,
2014) and Mathvista (Lu et al., 2023), when activating two
experts, keeping it the same with training setting, SPHINX-
MoE performs the best. However, for MME (Fu et al.,
2023b), when setting the number of activating experts to
four, SPHINX-MoE works the best. Two activating experts
actually make the second low-performance. This inconsis-
tency with the training setting is interesting.
A.1.2. E XPERTS ’USAGE DISTRIBUTION ON DIFFERENT
DOMAINS AND DIFFERENT MODALITIES
In some previous works, each expert in the mixture-of-
experts model is a specialist for a specific domain or modal-
ity,e.g., VLMO (Bao et al., 2022). So we explore that, in
SPHINX-MoE, how each expert in each layer deals with
data from different domains and different modalities. So
we pick the artwork, celebrity and OCR subtasks from the
MME (Fu et al., 2023b) benchmark, and infer SPHINX-
MoE on these subtasks with two activating experts, record-
ing the expert’s usage distribution of each layer, as shown
in Figure 4. Subfigure (a), (b) and (c) show the results on
vision modality, language modality and vision&language
modalities separately. From the distribution record, we
don’t see an obvious pattern that experts are specialists for
different domains or modalities. (i) For different domains,
the experts’ usage is similar for the three different domain
data: artwork, celebrity and OCR. (ii) For different modali-
ties, there are no specific experts that mainly deal with one
specific modality. But there is an interesting scenario that
the experts’ usage distribution of the layers at both ends of
the model is more flat than that of the the middle layers.
A.1.3. PRUNE SOME OF THE EXPERTS WHEN INFERENCE
Different from the dense model, for the sparse model only
part of the parameters are activated during the inference time.
So if we prune some experts, the ability of the model couldbe partly saved. To investigate how the number of pruned ex-
perts will affect SPHINX-MoE’s ability, we prune different
numbers of experts of SPHINX-MoE’s each layer, and the
results are shown in Figure 6. In Figure 6, the x-axis means
the number of retained experts after pruning. For each value
on the x-axis, termed as “ n”, we randomly choose 8−n
experts in each layer to be pruned in SPHINX-MoE. For
eachn, we run it three times for average performance.
We find that some experts are “important”, i.e., there is a
huge performance variance in the three runs. If we prune
some specific experts, SPHINX-MoE will lose most ability,
while in another run, the most ability of the model will be
retained even if we prune the same number of experts in
each layer. Thus, if we keep these “important” experts and
prune other less “important” experts in SPHINX-MoE, most
ability could be saved, as the upper tendency in Figure 6.
A.2. Video Analysis on MVBench
To further evaluate the video understanding capacity, we
evaluate SPHINX-X on MVBench (Li et al., 2023e), which
breaks down video understanding into 20 sub-aspects, allow-
ing for a more detailed comparison of model performance at
a finer granularity. As shown in Table 7, SPHINX-Plus, de-
spite being an image-based model, significantly outperforms
existing models (Jin et al., 2023; Su et al., 2023; Zhang
et al., 2023a) specifically tailored for video tasks. Especially
in the aspects of video-exclusive understanding and prior
knowledge-based question-answering, SPHINX-Plus show-
cases outstanding performance, signifying its proficiency in
visual perception and knowledge extraction capabilities. In
challenging datasets such as MOT, SPHINX-Plus demon-
strates slightly lower performance compared to existing
state-of-the-art methods (Maaz et al., 2023; Li et al., 2023d).
We attribute this to the need to model timing relationships in
videos. SPHINX-Plus has not been fine-tuned by any video
data, so its performance marginally underperforms others.
A.3. Additional details on the training dataset
Language Instruction-following Data. Unlike previous
works (Zhu et al., 2023a; Liu et al., 2023b;a) that uti-
lize instruction-tuned LLMs such as Vicuna (Chiang et al.,
2023), SPHINX-X is directly trained on top of the basic
pre-trained LLM, i.e., LLaMA2 (Touvron et al., 2023b).
This is to investigate the training characteristics of multi-
modal models from LLMs more clearly. Therefore, we are
required to collect a high-quality dataset combination for
language instruction-following. The dataset includes multi-
turn dialog, question-answering, code generation, and math
word problems. In detail, UltraChat (Ding et al., 2023) and
OpenOrca (Lian et al., 2023) are utilized for basic multi-
turn conversation abilities. MetaMath (Yu et al., 2023a) and
MathInstruct (Yue et al., 2023b) are high-quality mathemat-
17
Page 18:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
(a) Vision Modality
(c) Vision&Language Modality(b) Language Modality
Figure 4. Experts’ usage distribution on different domains and different modalities.
Table 8. Comparison with state-of-the-art methods on MVBench. ‘V’ denotes a video training version of the model used.
Methods Avg. AS AP AA FA UA OE OI OS MD AL ST AC MC MA SC FP CO EN ER CI
Video-based MLLM
OtterV26.8 23.0 23.0 27.5 27.0 29.5 53.0 28.0 33.0 24.5 23.5 27.5 26.0 28.5 38.5 22.0 18.0 22.0 23.5 19.0 19.5
mPLUG-OwlV29.7 22.0 28.0 34.0 29.0 29.0 40.5 27.0 31.5 27.0 23.0 29.0 31.5 27.0 44.0 24.0 40.0 31.0 26.0 20.5 29.5
Video-ChatGPT 32.7 23.5 26.0 62.0 22.5 26.5 54.0 28.0 40.0 23.0 20.0 31.0 30.5 25.5 48.5 29.0 39.5 33.0 29.5 26.0 35.5
Video-LLaMA 34.1 27.5 25.5 51.0 29.0 39.0 48.0 40.5 38.0 22.5 22.5 43.0 34.0 22.5 45.5 32.5 32.5 40.0 30.0 21.0 37.0
VideoChat 35.5 33.5 26.5 56.0 33.5 40.5 53.0 40.5 30.0 25.5 27.0 48.5 35.0 20.5 46.0 26.5 42.5 41.0 23.5 23.5 36.0
VideoChat2 51.1 66.0 47.5 83.5 49.5 60.0 58.0 71.5 42.5 23.0 23.0 88.5 39.0 42.0 44.0 49.0 58.5 36.5 35.0 40.5 65.5
Image-based MLLM
SPHINX 37.5 32.5 31.5 65.0 38.5 43.5 54.0 37.5 28.5 22.5 26.5 45.5 39.0 41.0 47.5 40.0 23.5 37.5 31.0 35.0 30.5
SPHINX-Plus 39.7 47.5 32.0 58.0 42.5 43.5 45.0 44.0 35.5 29.0 27.0 52.0 38.0 41.0 59.5 37.5 23.0 41.0 29.0 40.0 29.5
MME Score
Figure 5. Performance with different numbers of activating
experts when inference. We respectively report the performance
on MME and other benchmarks.
ical datasets with reasoning process. WizardCoder (Luo
et al., 2023) is adopted for increasing the coding ability of
LLMs. Flan-mini (Ghosal et al., 2023) is a subset of FLANdatasets and is included for question-answering capabilities.
Visual Instruction-following Data. For comprehensive
visual understanding, we expand the data scale of SPHINX
to incorporate a variety of vision tasks and transform their
annotations into a unified question-answering format. The
tasks include image classification (Russakovsky et al., 2014),
object detection such as COCO (Lin et al., 2014),Open-
Images (Kuznetsova et al., 2018),Object365 (Shao et al.,
2019),Lvis (Gupta et al., 2019), human pose estimation such
as UniPose (Yang et al., 2023b), COCO-Pose (Lin et al.,
2014), and visual grounding. We utilize a task-specific
prompt as the question, and regard the ground-truth labels
as the answer by textualizing them in language space. For
generality, we do not utilize any special tokens for different
tasks, and treat them all as pure language problems. This
18
Page 19:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
MME Score
Figure 6. Performance change when pruning different numbers
of experts in each layer. The black dotted line is the average of
the three runs of random pruning.
visual supervised fine-tuning enhances SPHINX-X with the
performance of image parsing, object localization, and rela-
tion reasoning, empowering MLLMs with in-built capacity
to be a universal visual generalist.
Vision-language Instruction-following Data. To align
MoV with LLMs and enable visual instruction follow-
ing, we gather large-scale datasets from established vi-
sual question-answering sources such as VQA V2 (Agrawal
et al., 2015), GQA (Hudson & Manning, 2019), OK-
VQA (Marino et al., 2019), Visual Genome (Krishna et al.,
2017), and CLEVR (Johnson et al., 2016). To specifi-
cally boost SPHINX-X’s text-oriented VQA capabilities,
we incorporate datasets including TextVQA (Singh et al.,
2019), DocVQA (Mathew et al., 2021b), ChartQA (Masry
et al., 2022), AI2D (Kembhavi et al., 2016), Deep-
form (Svetlichnaya, 2020), DVQA (Kafle et al., 2018),
InfographicsVQ (Mathew et al., 2021a), KleisterChar-
ity (Stanislawek et al., 2021), TabFact (Chen et al., 2019),
VisualMRC (Tanaka et al., 2021), and WikiTableQues-
tions (Pasupat & Liang, 2015). Leveraging the rich knowl-
edge embedded in large foundation models, we also en-
compass high-quality MLLM-generated data, e.g., dense
captioning data of ShareGPT4V (Chen et al., 2023b) and vi-
sual instruction data from LLaV A (Liu et al., 2023b), LVIS-
INSTRUCT4V (Wang et al., 2023a), and LLaV AR (Zhang
et al., 2023c). Additionally, we employ Geometry3K (Lu
et al., 2021a) to enhance the model’s geometry problem-
solving abilities.
19
Page 20:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Table 9. One-stage training data summary of SPHINX-X.
Tasks #Samples Datasets
Language Instruction-following Data
Multi-turn Dialog 1.8M UltraChat,Flan-mini,OpenOrca
Math 0.6M MetaMathQA,MathInstruct
Coding 80k WizardCoder
Visual Instruction-following Data
Detection 4.9M V3Det,OpenImages,Lvis,COCO,Object365
Human Pose 0.3M Unipose,COCO-Pose
Classification 1M ImageNet1K
Grounding 1M Visual Genome, RefCOCO, RefCOCO+,RefCOCOg,Flickr30k
Vision-language Instruction-following Data
VQA 0.7MVQA V2,OKVQA,GQA,Visual Genome
CLEVR,ChartQA,DeepForm,DocVQA
DVQA, InfographicsVQA,KleisterCharity
VisualMRC,WikiTableQuestions
TextVQA,TabFact
Caption 0.5M MSCOCO,ShareGPT4V ,LaionGPV4V
Visual Instruction 0.4M LLaV A,LVIS-INSTRUCT4V ,LLaV AR
OCR-intensive Data
OCR 3M PaperText: Arxiv, Common Crawl
Text Layout & Spotting 1.0 M DocBank, M6Doc, Publaynet, DocLayNet, ICDAR, CTW1500
Set-of-Marks Instruction-following Data
Natural Images 5k COCO, LVIS, Visual Genome
Website/Mobile/Desktop agent 1k SeeClick
OCR-related 2k TotalText,CTW1500,IC13,IC15
Document Images 1k M6Doc, DoclayNet, PublayNet
Multipanel Images 1k In-house dataset
20
Page 21:
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
SPHINX -X:
Mark 1: The object is a cat, specifically a calico, characterized by its
tricolor fur pattern of white, black, and orange. The cat is wearing a
collar, which is a common accessory for domestic cats to help identify
them and keep them safe. The collar appears to be of a standard
design, likely made of nylon or a similar material, and is fitted around
the cat's neck. The collar is not fully visible, but it is clear that it is
snugly fitted and is a practical accessory for pet owners.
Mark 2: This is a beer bottle, likely made of glass, with a label that is
partially visible. The label contains text and possibly branding or
product information. The bottle is standing upright on a ledge,
suggesting it has been placed there intentionally. The bottle appears
to be full or nearly full, as the liquid level is near the top. The
bottle's cap is not visible, indicating it may have been removed or is
not currently in place.User:
Provide a detailed description for each of the marks I've
offered: <Mark 1> [0.56, 0.24], <Mark 2> [0.76, 0.50]Set-of-Mark Prompting Task
SPHINX -X:
Mark 1: Cat wearing a collar
Mark 2: Beer bottle on a ledgeUser:
Provide a short description for each of the marks I've
offered: <Mark 1> [0.56, 0.24], <Mark 2> [0.76, 0.50]
User:
Analyze the relationships among the marks I've offered:
<Mark 1> [0.56, 0.24], <Mark 2> [0.76, 0.50]SPHINX -X:
<Mark 1> <Mark 2> : The cat and the bottle are both
situated outdoors, with the cat appearing to be on a
ledge and the bottle on the ground. The cat is positioned
in the foreground, while the bottle is in the background.
There is no direct interaction between the two, but they
share the same environment.
SPHINX -X:
Sure, please ask your questions.User:
Answer the following questions based on the marks I’ve
provided: <Mark 1> [0.56, 0.24], <Mark 2> [0.76, 0.50]
Short CaptioningDetailed Captioning
Relation AnalysisVisual Question Answering
SPHINX -X:
The cat marked <Mark 1> is a calico, which is a breed
with a tricolor coat, typically white, black, and orange.User:
Can you tell me what breed the cat marked <Mark 1>
is? What is his coat color like?
SPHINX -X:
There are “Virtual[0.25,0.30,0.52,0.38], Office[0.53,0.31,0.72,0.38],
of[0.23,0.39,0.30,0.46], the[0.34,0.39,0.47,0.46], ”Future“[0.49,0.39,0.72,0.45],
1st[0.25,0.68,0.30,0.72], floor[0.32,0.68,0.39,0.72],
Deutsches [0.61,0.48,0.74,0.52], Forschungszentrum [0.61,0.52,0.79,0.56],
for[0.61,0.56,0.64,0.59], Kunstliche [0.65,0.56,0.77,0.60],
Intelligenz [0.61,0.59,0.76,0.63], GmbH[0.74,0.59,0.78,0.62],
1.21[0.76,0.67,0.80,0.71], and Competence[0.24,0.20,0.50,0.25] in the image.User:
Detect all regions in the
image that contain text and
identify their content.OCR-intensive Task <Mark 1>
<Mark 2>
Text Spotting
SPHINX -X:
There are seven
text[0.16,0.50,0.48,0.64;0.16,0.64,0.48,0.74;0.16,0.73,0.48,0.85;0.16,0.85,
0.48,0.98;0.50,0.43,0.83,0.59;0.50,0.59,0.83,0.67;0.50,0.72,0.83,0.98],
figure[0.51,0.06,0.82,0.31], list[0.16,0.06,0.48,0.49], and
title[0.50,0.69,0.71,0.71] in the image .User:
Detect the layout of the document in the image.Docume ntLayout Detection
Figure 7. Set-of-Marks (SoM) prompting and OCR-intensive capabilities of SPHINX-X. With our constructed two datasets, SPHINX-
X exhibits outstanding visual performance on SoM prompting and OCR-related tasks. Note that the SoM marks are only utilized in the
textual prompt, without rendering on input images.
21