Amber Tanaka commited on
Commit
95d1ab7
·
unverified ·
1 Parent(s): c917078

Update copy (#39)

Browse files
Files changed (3) hide show
  1. about.py +24 -24
  2. category_page_builder.py +1 -1
  3. content.py +15 -41
about.py CHANGED
@@ -5,52 +5,52 @@ def build_page():
5
  gr.Markdown(
6
  """
7
  ## About AstaBench
8
- AstaBench is a best-in-class AI agents evaluation framework to measure scientific research abilities. AstaBench provides a challenging new test for AI agents: the first benchmark challenge that evaluates agents’ scientific abilities on a broad spectrum of research skills, including literature understanding, data analysis, planning, tool use, coding, and search.
9
 
10
 
11
- **Why AstaBench?**
12
- Newer benchmarks may test agentic AI and isolated aspects of scientific reasoning, but none rigorously measure agentic AI or capture the full range of skills research demands. Agents can appear effective by simply retrying tasks—often at high computational cost and with inconsistent results. Scientific AI needs evaluations that reflect the real complexity of research.
13
 
14
- AstaBench fills that gap: a suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. AstaBench helps scientists identify which agents best support their needs through task-relevant leaderboards, while giving AI developers a standard execution environment and standard tools to test the scientific reasoning capabilities of their agents compared to well-known baselines from the literature, including both open and closed LLM foundation models.
15
 
16
 
17
- **What Does AstaBench Include?**
18
- The suite includes over 8,000 tasks across 11 benchmarks, organized into four core categories:
19
- - Literature Understanding
20
- - Code & Execution
21
- - Data Analysis
22
- - End-to-End Discovery
 
23
 
24
  🔍 Learn more in the AstaBench technical blog post
25
 
26
 
27
- **Understanding the Leaderboards**
28
- The AstaBench Main Leaderboard provides a high-level view of overall agent performance and efficiency:
29
  - Overall score: A macro-average of the four category-level averages (equal weighting)
30
  - Overall cost: Average cost per task, aggregated only across benchmarks with reported cost
31
 
32
  Each category leaderboard provides:
33
- - Average score and cost for that category
34
  - A breakdown by individual benchmarks
35
 
36
 
37
- **Scoring & Aggregation**
38
- AstaBench encourages broad, honest evaluation. Here's how we handle scoring, cost, and partial results:
39
 
40
- _Scores_
41
- - Each benchmark returns an average score based on per-task accuracy
42
- - Skipped benchmarks receive a score of 0.00
43
  - All scores are aggregated upward using macro-averaging
44
  - Partial completions are included (even with poor performance)
45
 
46
- _Cost_
47
- - Costs are reported in USD per task, based on values at the time of submission
48
  - Benchmarks without cost data are excluded from cost averages
49
- - In scatter plots, agents without cost are plotted far right and clearly marked
50
- Note: Cost values reflect pricing and infrastructure conditions at the time of each submission. We recognize that compute costs may change over time, and are actively working on methods to normalize cost data across submissions for fairer longitudinal comparisons.
51
 
 
52
 
53
- _Coverage_
54
  - Main leaderboard: category coverage (X/4)
55
  - Category view: benchmark coverage (X/Y)
56
  - Incomplete coverage is flagged visually
@@ -58,7 +58,7 @@ _Coverage_
58
  These design choices ensure fair comparison while penalizing cherry-picking and omissions.
59
 
60
 
61
- **Learn More**
62
  - AstaBench technical blog post
63
  - FAQ and submission guide
64
  """, elem_id="about-content"
 
5
  gr.Markdown(
6
  """
7
  ## About AstaBench
8
+ AstaBench is a novel AI agents evaluation framework, providing a challenging new test for AI agents: the first benchmark challenge that evaluates agents’ scientific abilities on a broad spectrum of research skills, including literature understanding, data analysis, planning, tool use, coding, and search. Asta’s set of standard tools makes it easy to build general-purpose science agents and to compare their performance in an apples-to-apples manner.
9
 
10
 
11
+ ## Why AstaBench?
12
+ Most current benchmarks test agentic AI and isolated aspects of scientific reasoning, but rarely evaluate AI agentic behavior rigorously or capture the full skill set scientific research requires. Agents can appear effective despite inconsistent results and high compute use, often outperforming others by consuming more resources. Advancing scientific AI requires evaluations that emphasize reproducibility, efficiency, and the real complexity of research.
13
 
14
+ AstaBench fills this gap: an agents evaluation framework and suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. AstaBench helps scientists identify which agents best support their needs through task-relevant leaderboards, while giving AI developers a standard execution environment and tools to test the scientific reasoning capabilities of their agents compared to well-known baselines from the literature, including both open and closed LLM foundation models.
15
 
16
 
17
+ ## What Does AstaBench Include?
18
+ AstaBench includes a rigorous agents evaluation framework and a suite of benchmarks consisting of over 2,400 problems across 11 benchmarks, organized into four core categories:
19
+ Literature Understanding
20
+ Code & Execution
21
+ Data Analysis
22
+ End-to-End Discovery
23
+ Plus: a large suite of integrated agents and leaderboards with results from extensive evaluation of agents and models.
24
 
25
  🔍 Learn more in the AstaBench technical blog post
26
 
27
 
28
+ ## Understanding the Leaderboards
29
+ The AstaBench Overall Leaderboard provides a high-level view of overall agent performance and efficiency:
30
  - Overall score: A macro-average of the four category-level averages (equal weighting)
31
  - Overall cost: Average cost per task, aggregated only across benchmarks with reported cost
32
 
33
  Each category leaderboard provides:
34
+ - Average score and cost for that category (macro-averaged across the benchmarks in the category)
35
  - A breakdown by individual benchmarks
36
 
37
 
38
+ ## Scoring & Aggregation
39
+ AstaBench encourages careful, transparent evaluation. Here's how we handle scoring, cost, and partial results:
40
 
41
+ **Scores**
42
+ - Each benchmark returns an average score based on per-problem scores
 
43
  - All scores are aggregated upward using macro-averaging
44
  - Partial completions are included (even with poor performance)
45
 
46
+ **Cost**
47
+ - Costs are reported in USD per task.
48
  - Benchmarks without cost data are excluded from cost averages
49
+ - In scatter plots, agents without cost are plotted to the far right and clearly marked.
 
50
 
51
+ Note: Cost values reflect pricing and infrastructure conditions at the time of each submission. We recognize that compute costs may change over time, and are actively working on methods to normalize cost data across submissions for fairer longitudinal comparisons.
52
 
53
+ **Coverage**
54
  - Main leaderboard: category coverage (X/4)
55
  - Category view: benchmark coverage (X/Y)
56
  - Incomplete coverage is flagged visually
 
58
  These design choices ensure fair comparison while penalizing cherry-picking and omissions.
59
 
60
 
61
+ ## Learn More
62
  - AstaBench technical blog post
63
  - FAQ and submission guide
64
  """, elem_id="about-content"
category_page_builder.py CHANGED
@@ -13,7 +13,7 @@ def build_category_page(CATEGORY_NAME, PAGE_DESCRIPTION):
13
 
14
  with gr.Column(elem_id="test_nav_container", visible=True) as test_nav_container:
15
  create_sub_navigation_bar(test_tag_map, CATEGORY_NAME)
16
- gr.Markdown(f"## Astabench{CATEGORY_NAME} Leaderboard")
17
  gr.Markdown(PAGE_DESCRIPTION, elem_id="category-intro")
18
  # --- This page now has two main sections: Validation and Test ---
19
  with gr.Tabs():
 
13
 
14
  with gr.Column(elem_id="test_nav_container", visible=True) as test_nav_container:
15
  create_sub_navigation_bar(test_tag_map, CATEGORY_NAME)
16
+ gr.Markdown(f"## Astabench{CATEGORY_NAME} Leaderboard (Aggregate)")
17
  gr.Markdown(PAGE_DESCRIPTION, elem_id="category-intro")
18
  # --- This page now has two main sections: Validation and Test ---
19
  with gr.Tabs():
content.py CHANGED
@@ -38,57 +38,31 @@ Agents names that are green are Pareto optimal, meaning they achieve the best pe
38
  """
39
  LIT_DESCRIPTION = """
40
  The **Literature Understanding** category evaluates how well agents comprehend and interact with scientific literature—testing their ability to find research papers, assess citation quality, extract information from text, and more.
 
 
 
 
41
  <br>
42
- The scores shown below reflect performance aggregated across five distinct benchmarks, each targeting a different aspect of literature-based reasoning:
43
- <br>
44
- - PaperFinding Bench – PLACEHOLDER DESCRIPTION
45
- <br>
46
- - ScholarQA Bench2 – PLACEHOLDER DESCRIPTION
47
- <br>
48
- - LitQA2-FT – PLACEHOLDER DESCRIPTION
49
- <br>
50
- - ArxivDIGES Tables-Clean – PLACEHOLDER DESCRIPTION
51
- <br>
52
- <br>
53
- Together, these tasks form a comprehensive evaluation of an agent’s ability to navigate, understand, and reason over scientific publications
54
  """
55
  CODE_EXECUTION_DESCRIPTION = """
56
- The **Code & Execution** category in AstaBench includes tasks that evaluate an agent’s ability to write, modify, and run code in realistic research scenarios. Unlike literature tasks—which can sometimes be solved by a language model alone—these problems often require the agent to interact with tools: reading input files, executing code, and writing outputs to specific files in the required format.
57
- <br>
58
- <br>
59
- The scores in this category are aggregated from three distinct benchmarks, each targeting different facets of scientific coding and execution:
60
- <br>
61
- - CORE-Bench-Hard – PLACEHOLDER DESCRIPTION
62
- <br>
63
- - DS-1000 – PLACEHOLDER DESCRIPTION
64
- <br>
65
- - SUPER-Expert – PLACEHOLDER DESCRIPTION
66
  <br>
67
- <br>
68
- Together, these benchmarks evaluate whether an agent can function as a hands-on scientific assistant—not just by reasoning about code, but by running it in real-world contexts.
69
  """
70
  DATA_ANALYSIS_DESCRIPTION = """
71
- The **Data Analysis** category evaluates agents on their ability to analyze structured datasets and generate meaningful scientific hypotheses. It currently includes a single benchmark:
72
- <br>
73
- - DiscoveryBench
74
- <br>
75
- so the category-level scores are the same as the benchmark-level results.
76
- <br>
77
- <br>
78
  As additional benchmarks are added in the future, this category will expand to cover a broader range of data-driven reasoning tasks across scientific domains.
 
79
  """
80
  DISCOVERY_DESCRIPTION = """
81
- The **End-to-End Discovery** category tests whether agents can carry out a complete scientific workflowfrom hypothesis generation and experiment design to code execution, analysis, and report writing. These tasks require agents to integrate multiple capabilities, producing not just answers but full research artifacts.
82
- <br>
83
- <br>
84
- Scores in this category are aggregated from two benchmarks:
85
- <br>
86
- - E2E-Bench – PLACEHOLDER DESCRIPTION
87
- <br>
88
- - E2E-Bench-Hard – PLACEHOLDER DESCRIPTION
89
- <br>
90
  <br>
91
- This category provides the first standardized way to evaluate automated scientific discovery (ASD) agents across all stages of the research process.
92
  """
93
 
94
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 
38
  """
39
  LIT_DESCRIPTION = """
40
  The **Literature Understanding** category evaluates how well agents comprehend and interact with scientific literature—testing their ability to find research papers, assess citation quality, extract information from text, and more.
41
+ <br><br>
42
+ The scores shown below reflect performance aggregated across five distinct benchmarks, each targeting a different aspect of literature-based reasoning.
43
+ <br><br>
44
+ For detailed results, use the links above to explore individual benchmarks.
45
  <br>
 
 
 
 
 
 
 
 
 
 
 
 
46
  """
47
  CODE_EXECUTION_DESCRIPTION = """
48
+ The **Code & Execution** category in AstaBench includes tasks that evaluate an agent’s ability to write, modify, and run code in realistic research scenarios. Unlike literature tasks—which only require read-only tools and can sometimes even be solved by a language model alone—these problems often require the agent to manipulate a machine environment with tools: reading input files, executing code, and writing outputs to specific files in the required format.
49
+ <br><br>
50
+ The scores in this category are aggregated from three distinct benchmarks, each targeting different facets of scientific coding and execution. Together, these benchmarks evaluate whether an agent can function as a hands-on scientific assistant—not just by reasoning about code, but by running it in real-world contexts.
51
+ <br><br>
52
+ For detailed results, use the links above to explore individual benchmark pages.
 
 
 
 
 
53
  <br>
 
 
54
  """
55
  DATA_ANALYSIS_DESCRIPTION = """
56
+ The **Data Analysis** category evaluates agents on their ability to analyze structured datasets and generate meaningful scientific hypotheses. It currently includes a single benchmark, DiscoveryBench, so the category-level scores are the same as the benchmark-level results.
57
+ <br><br>
 
 
 
 
 
58
  As additional benchmarks are added in the future, this category will expand to cover a broader range of data-driven reasoning tasks across scientific domains.
59
+ <br>
60
  """
61
  DISCOVERY_DESCRIPTION = """
62
+ The **End-to-End Discovery** category tests whether agents can carry out a complete scientific workflow, from task description to experiment design, code execution, results analysis, and report writing. These tasks require agents to integrate multiple capabilities, producing not just answers but full research artifacts.
63
+ <br><br>
64
+ Scores in this category are aggregated from two benchmarks, providing the first standardized way to evaluate automated scientific discovery (ASD) agents across all stages of the research process. Use the links above to explore individual benchmark pages.
 
 
 
 
 
 
65
  <br>
 
66
  """
67
 
68
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"