Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

XLANG Lab, The University of Hong Kong

osworld task_demonstration

Abstract

Language models have demonstrated remarkable performance in code generation, including text-to-SQL tasks. However, real-world enterprise-level text-to-SQL workflows often involve complex cloud or local data across various database systems, multiple SQL queries in various dialects, and diverse operations from data transformation to analytics. These workflows typically process natural language analytic questions, yet we lack a comprehensive and challenging testbed that encapsulates these phenomena, essential for advancing the capabilities of these models and evaluating their true potential in code generation, specifically in text-to-SQL tasks. To this end, we introduce **Spider 2.0**, an evaluation framework comprising 600 real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in **Spider 2.0** are sourced from real data applications, often containing over 1,000 columns and stored in cloud or local database systems such as BigQuery, Snowflake, or PostgreSQL. Solving problems in **Spider 2.0** frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines, that go far beyond traditional text-to-SQL challenges. Our evaluations indicate that current state-of-the-art LLMs, such as GPT-4, and our proposed LLM-based Spider-Agent can solves a mere 6.0% of the questions, compared to 86.6% on Spider 1.0 and 57.4% on BIRD. These results underscore the significant challenges posed by **Spider 2.0**. Progress on **Spider 2.0** represent crucial steps towards developing LLM-based code agents that are more intelligent, and autonomous in real-world enterprise settings.

News

  • Aug. 28, 2024: We released a smaller version of Spider 2.0 (~ 25% of the full dataset) containing 190 examples to give users early access. The full dataset and the paper will be available in two weeks. As this is a preliminary release, there may be errors. Your feedback would be invaluable in refining the dataset. Stay tuned!

Why Spider 2.0?

In 2018, we introduced Spider 1.0 , SParC, and CoSQL as part of the Yale Semantic Parsing and Text-to-SQL Challenge Series, attracting over 300 submissions from leading research labs worldwide.

Now, in the era of Large Language Models (LLMs), we present Spider 2.0 to advance code generation, particularly text-to-SQL capabilities.

This new benchmark offers a more realistic and challenging test of LLMs' performance on complex enterprise-level text-to-SQL workflows, involving complex data environments (e.g., >3000 columns), multiple SQL dialects (e.g., BigQuery, Snowflake), and diverse operations (e.g., transformation, analytics).

Notably, as shown below, even the most advanced LLMs, including GPT-4, solve only 6.0% of Spider 2.0 tasks, compared to 86.6% on Spider 1.0 and 57.4% on BIRD, highlighting the significant challenges posed by Spider 2.0.

Spider 1.0 dev Spider 1.0 test BIRD test Spider 2.0
DailSQL + GPT-4 82.4 86.6 57.4 6.0
CodeS-7B 85.4 - 59.3 1.3

Spider 2.0-Lite

To meet with research interests in traditional Text2SQL settings, we also release a subset of Spider 2.0 called Spider 2.0-Lite which is more self-contained, to support faster development and evaluation.

Data Examples

test image

Have Questions?

Ask us questions at our Github issues page or contact Fangyu Lei, Jixuan Chen, Ruisheng Cao or Yuxiao Ye for more information.

Leaderboard

We report the score from Spider 2.0 evaluation suite.
The agent has to interact with complex SQL workflows, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines across multiple turns.
Rank Method Score

1

Aug 24, 2024
Spider-Agent + GPT-4o 7.36