Spider 2.0

About Spider 2.0

Spider 2.0 is an evaluation framework comprising 632 real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines, which goes far beyond traditional text-to-SQL challenges.

News

2025-05-22: We have created a new task setting, Spider2-DBT, and removed the original Spider2 setting. Spider2-dbt consists of only 68 tasks, enabling quick and smooth benchmarking with spider-agent-dbt . It is a comprehensive, repository-level text-to-SQL task.
2025-04-20: We provide the ground-truth tables for spider2-lite and spider2-snow to help quick benchmarking and analysis. However, when using this setting, you must indicate that you are using oracle tables.
2025-01-10: Please refer to the data update log to track changes in the evaluation examples. The leaderboard results will also change dynamically accordingly.
2025-01-07: Please note that we do not recommend using the Spider 2.0 Gold SQL we released for SFT, as it may affect the fairness of evaluation and hinder better benchmarking of the model's SQL capabilities. The release of Gold SQL is intended to help users design prompts.
2024-12-26: Using Spider-Agent to benchmark your LLMs! Considering the widespread attention to the traditional text-to-SQL setting, we now recommend using spider-agent-lite and spider-agent-snow to work with spider2-lite and spider2-snow for benchmarking your LLMs. The final output should be CSV files, not SQLs.
2024-12-24: Considering the many evaluation requirements, we have decided to release all examples and gold answers for self-evaluation. However, only a small amount of gold SQL is available. The leaderboard is still active. To have your method officially validated and upload your scores to the leaderboard, please follow the submission guidance.

Milestone

As of now, all methods combined can solve 71.66% (392/547) of the examples in Spider 2.0!

Why Spider 2.0?

In 2018, we introduced Spider 1.0 , SParC, and CoSQL as part of the Yale Semantic Parsing and Text-to-SQL Challenge Series, attracting over 300 submissions from leading research labs worldwide.

Now, in the era of Large Language Models (LLMs), we present Spider 2.0 to advance code generation, particularly text-to-SQL capabilities.

This new benchmark offers a more realistic and challenging test of LLMs' performance on complex enterprise-level text-to-SQL workflows, involving complex data environments (e.g., >3000 columns), multiple SQL dialects (e.g., BigQuery, Snowflake), and diverse operations (e.g., transformation, analytics).

Notably, even the advanced LLMs-o1-preview solve only 17.1% of Spider 2.0 tasks. For widely-used models like GPT-4o, the success rate is only 10.1% on Spider 2.0 tasks, compared to 86.6% on Spider 1.0, underscoring the substantial challenges posed by Spider 2.0.

Setting	Task Type	#Examples	Databases	Cost
Spider 2.0-Snow	Text-to-SQL task	547	Snowflake(547)	NO COST!😊
Spider 2.0-Lite	Text-to-SQL task	547	BigQuery(214), Snowflake(198), SQLite(135)	Some cost incurred
Spider 2.0-DBT	Code agent task	68	DuckDB (DBT)(68)	NO COST!😊

Acknowledgement

We thank Snowflake for their generous support in hosting the Spider 2.0 Challenge. We also thank Minghang Deng, Tianbao Xie, Yiheng Xu, Fan Zhou, Yuting Lan, Per Jacobsson, Yiming Huang, Canwen Xu, Zhewei Yao, and Binyuan Hui for their helpful feedback on this work. The website and submission guidelines are greatly inspired by BIRD-SQL, and we thank them for their contributions.

Data Examples

Have Questions?

Ask us questions at our Github issues page or contact Fangyu Lei, Jixuan Chen, Ruisheng Cao or Yuxiao Ye for more information.

Citation

@article{lei2024spider,
  title={Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows},
  author={Lei, Fangyu and Chen, Jixuan and Ye, Yuxiao and Cao, Ruisheng and Shin, Dongchan and Su, Hongjin and Suo, Zhaoqing and Gao, Hongcheng and Hu, Wenjing and Yin, Pengcheng and others},
  journal={arXiv preprint arXiv:2411.07763},
  year={2024}
}

Leaderboard

Spider 2.0-Snow is a self-contained text-to-SQL task that includes well-prepared database metadata and documentation, includes 547 examples, all hosted on Snowflake, which offers participants free quotas.
Methods with -* use special settings (ground-truth tables) and are not included in the ranking.

Rank	Method	Score
	WindAgent + Claude-4-Sonnet AI For FinData	45.34
	Meituan-agent Meituan FinData Intelligence	44.79
	Chat2DB-Agent + Claude-4-Sonnet Chat2DB	44.06
	ByteBrain-Agent (w GT Tables) ByteDance Infra System Lab	43.69
	Ask Data with Relational Knowledge Graph AT&T CDO & RelationalAI	38.39
	DB-surfer + Qwen3 Alibaba Cloud	38.21
	ReFoRCE + o3 Hao AI Lab x Snowflake [Deng et al. '25]	37.11
	ReFoRCE + o1-preview Hao AI Lab x Snowflake [Deng et al. '25]	31.26
	Spider-Agent + Claude-4-Sonnet-20250514	25.78
	Spider-Agent + Claude-3.7-Sonnet-20250219	24.50
	Spider-Agent + Claude-3.7-Sonnet-20250219-Thinking	24.31
	Spider-Agent + o1-preview	23.58
	Spider-Agent + o1-2024-12-17	23.21
	Spider-Agent + o3-mini-2025-01-31	19.20
	Spider-Agent + Claude-3.5-Sonnet-20241022 AWS ProServe	19.01
	Spider-Agent + Claude-3.5-Sonnet-20241022	15.54
	Spider-Agent + Gemini-2.0-Pro	13.89
	Spider-Agent + GPT-4o-2024-11-20	12.98
	Spider-Agent + DeepSeek-R1	10.55
	CollideNL2SQL + GPT-4o Collide Tech	9.68
	ACNL2SQL-o3 ALIBABA EI	9.14
	Spider-Agent + QwQ-32B	8.96
	Spider-Agent + DeepSeek-V3	8.78
	Spider-Agent + Qwen2.5-Coder-32B-Instruct	5.48
	Dail-SQL + GPT-4o	2.20
	CHESS + GPT-4o	1.28
	DIN-SQL + GPT-4o	0.00
	SFT CodeS-15B	0.00

Rank	Method	Score
	ReFoRCE + o3 Hao AI Lab x Snowflake [Deng et al. '25]	37.84
	RSL-SQL + o3 HUST VLR Lab [Cao et al. '24]	33.09
	LinkAlign + DeepSeek-R1 [Wang et al. '25]	33.09
	RSL-SQL + DeepSeek-R1 HUST VLR Lab [Cao et al. '24]	30.53
	ReFoRCE + o1-preview Hao AI Lab x Snowflake [Deng et al. '25]	30.35
	Spider-Agent + Claude-3.7-Sonnet-20250219-Thinking	28.52
	Spider-Agent + Claude-4-Sonnet-20250514	27.79
	RSL-SQL + DeepSeek-V3 HUST VLR Lab [Cao et al. '24]	26.14
	Spider-Agent + Claude-3.7-Sonnet-20250219	25.41
	LinkAlign + DeepSeek-V3 [Wang et al. '25]	24.86
	Spider-Agent + o3-mini-2025-01-31	23.40
	Spider-Agent + o1-preview	23.03
	Spider-Agent + DeepSeek-R1	13.71
	Spider-Agent + GPT-4o-2024-11-20	13.16
	Spider-Agent + QwQ-32B	11.33
	Duo Anonymous	8.96
	Spider-Agent + Claude-3.5-Sonnet-20240620	8.32
	Spider-Agent + Qwen2.5-Coder-32B-Instruct	5.85
	DailSQL + GPT-4o	5.68
	CHESS + GPT-4o	3.84
	DIN-SQL + GPT-4o	1.46
	SFT CodeS-15B	0.73

Rank	Method	Score
	DAQUV_QUVI_Agent DAQUV	17.65
	Spider-Agent + Claude-3.7-Sonnet-20250219	14.70
	Spider-Agent + o1-preview	13.24
	Spider-Agent + GPT-4o	7.35
	Spider-Agent + o3-mini	4.41
	Spider-Agent + o3	2.94