-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy pathspider2-lite.jsonl
547 lines (547 loc) · 227 KB
/
spider2-lite.jsonl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
{"instance_id": "bq011", "db": "ga4", "question": "How many pseudo users were active in the last 7 days but inactive in the last 2 days as of January 7, 2021?", "external_knowledge": "ga4_obfuscated_sample_ecommerce.events.md"}
{"instance_id": "bq010", "db": "ga360", "question": "Find the top-selling product among customers who bought 'Youtube Men’s Vintage Henley' in July 2017, excluding itself.", "external_knowledge": "google_analytics_sample.ga_sessions.md"}
{"instance_id": "bq009", "db": "ga360", "question": "Which traffic source has the highest total transaction revenue for the year 2017, and what is the difference in millions (rounded to two decimal places) between the highest and lowest monthly total transaction revenue for that traffic source?", "external_knowledge": "google_analytics_sample.ga_sessions.md"}
{"instance_id": "bq001", "db": "ga360", "question": "For each visitor who made at least one transaction in February 2017, how many days elapsed between the date of their first visit in February and the date of their first transaction in February, and on what type of device did they make that first transaction?", "external_knowledge": "google_analytics_sample.ga_sessions.md"}
{"instance_id": "bq002", "db": "ga360", "question": "During the first half of 2017, focusing on hits product revenue, which traffic source generated the highest total product revenue, and what were the maximum daily, weekly, and monthly product revenues (in millions) for that top-performing source over this period?", "external_knowledge": "google_analytics_sample.ga_sessions.md"}
{"instance_id": "bq003", "db": "ga360", "question": "Between April 1 and July 31 of 2017, using the hits product revenue data along with the totals transactions to classify sessions as purchase (transactions ≥ 1 and productRevenue not null) or non-purchase (transactions null and productRevenue null), compare the average pageviews per visitor for each group by month", "external_knowledge": "google_analytics_sample.ga_sessions.md"}
{"instance_id": "bq004", "db": "ga360", "question": "In July 2017, among all visitors who bought any YouTube-related product, which distinct product—excluding those containing ‘YouTube’ in the product name—had the highest total quantity purchased?", "external_knowledge": "google_analytics_sample.ga_sessions.md"}
{"instance_id": "bq008", "db": "ga360", "question": "In January 2017, among visitors whose campaign name contains 'Data Share' and who accessed any page starting with '/home', which page did they most commonly visit next, and what is the maximum time (in seconds) they spent on the '/home' page before moving on?", "external_knowledge": "google_analytics_sample.ga_sessions.md"}
{"instance_id": "bq269", "db": "ga360", "question": "Between June 1, 2017, and July 31, 2017, consider only sessions that have non-null pageviews. Classify each session as ‘purchase’ if it has at least one transaction, or ‘non_purchase’ otherwise. For each month, sum each visitor’s total pageviews under each classification, then compute the average pageviews per visitor for both purchase and non-purchase groups in each month, and present the results side by side.", "external_knowledge": null}
{"instance_id": "bq268", "db": "ga360", "question": "Identify the longest number of days between the first visit and the last recorded event (either the last visit or the first transaction) for a user, where the last recorded event is associated with a mobile device. The last recorded event could either be the last visit or the first transaction, and you should focus on users whose last recorded event occurred on a mobile device.", "external_knowledge": null}
{"instance_id": "bq270", "db": "ga360", "question": "What were the monthly add-to-cart and purchase conversion rates, calculated as a percentage of pageviews on product details, from January to March 2017?", "external_knowledge": "ga360_hits.eCommerceAction.action_type.md"}
{"instance_id": "bq275", "db": "ga360", "question": "Which visitor IDs belong to users whose first transaction occurred on a device explicitly labeled as 'mobile' on a later date than their first visit?", "external_knowledge": null}
{"instance_id": "bq374", "db": "ga360", "question": "Calculates the percentage of new users who, between August 1, 2016, and April 30, 2017, both stayed on the site for more than 5 minutes during their initial visit and made a purchase on a subsequent visit at any later time, relative to the total number of new users in the same period.", "external_knowledge": null}
{"instance_id": "sf_bq029", "db": "PATENTS", "question": "Get the number of patent publications and the average number of inventors per patent in CA every five years from 1960 to 2020, based on when the patents were filed. Focus only on patents with inventor details.", "external_knowledge": null}
{"instance_id": "sf_bq026", "db": "PATENTS", "question": "For the assignee who has been the most active in the patent category 'A61', I'd like to know the five patent jurisdictions code where they filed the most patents during their busiest year, separated by commas.", "external_knowledge": null}
{"instance_id": "sf_bq091", "db": "PATENTS", "question": "In which year did the assignee with the most applications in the patent category 'A61' file the most?", "external_knowledge": null}
{"instance_id": "sf_bq099", "db": "PATENTS", "question": "For patent class A01B3, I want to analyze the information of the top 3 assignees based on the total number of applications. Please provide the following five pieces of information: the name of this assignee, total number of applications, the year with the most applications, the number of applications in that year, and the country code with the most applications during that year.", "external_knowledge": null}
{"instance_id": "sf_bq033", "db": "PATENTS", "question": "How many U.S. publications related to IoT (where the abstract includes the phrase 'internet of things') were filed each month from 2008 to 2022, including months with no filings?", "external_knowledge": null}
{"instance_id": "sf_bq209", "db": "PATENTS", "question": "Can you find how many utility patents granted in 2010 have exactly one forward citation within the ten years following their application date?", "external_knowledge": null}
{"instance_id": "sf_bq027", "db": "PATENTS", "question": "For patents granted between 2010 and 2018, provide the publication number of each patent and the number of backward citations it has received in the SEA category.", "external_knowledge": null}
{"instance_id": "sf_bq210", "db": "PATENTS", "question": "How many US B2 patents granted between 2008 and 2018 contain claims that do not include the word 'claim'?", "external_knowledge": null}
{"instance_id": "sf_bq211", "db": "PATENTS", "question": "Among patents granted between 2010 and 2023 in CN, how many of them belong to families that have a total of over one distinct applications?", "external_knowledge": null}
{"instance_id": "sf_bq213", "db": "PATENTS", "question": "What is the most common 4-digit IPC code among US B2 utility patents granted from June to August in 2022?", "external_knowledge": "patents_info.md"}
{"instance_id": "sf_bq212", "db": "PATENTS", "question": "For United States utility patents under the B2 classification granted between June and September of 2022, identify the most frequent 4-digit IPC code for each patent. Then, list the publication numbers and IPC4 codes of patents where this code appears 10 or more times.", "external_knowledge": "patents_info.md"}
{"instance_id": "sf_bq214", "db": "PATENTS_GOOGLE", "question": "For United States utility patents under the B2 classification granted between 2010 and 2014, find the one with the most forward citations within a month of its filing date, and identify the most similar patent from the same filing year, regardless of its type.", "external_knowledge": "patents_info.md"}
{"instance_id": "sf_bq216", "db": "PATENTS_GOOGLE", "question": "Identify the top five patents filed in the same year as `US-9741766-B2` that are most similar to it based on technological similarities. Please provide the publication numbers.", "external_knowledge": "patents_info.md"}
{"instance_id": "sf_bq247", "db": "PATENTS_GOOGLE", "question": "From the publications dataset, first identify the top six families with the most publications whose family_id is not '-1'. Then, using the abs_and_emb table (joined on publication_number), provide each of those families’ IDs alongside every non-empty abstract associated with their publications.", "external_knowledge": null}
{"instance_id": "sf_bq127", "db": "PATENTS_GOOGLE", "question": "For each publication family whose earliest publication was first published in January 2015, please provide the earliest publication date, the distinct publication numbers, their country codes, the distinct CPC and IPC codes, distinct families (namely, the ids) that cite and are cited by this publication family. Please present all lists as comma-separated values, sorted alphabetically", "external_knowledge": null}
{"instance_id": "sf_bq215", "db": "PATENTS", "question": "Which US patent (with a B2 kind code and a grant date between 2015 and 2018) has the highest originality score calculated as 1 - (the sum of squared occurrences of distinct 4-digit IPC codes in its backward citations divided by the square of the total occurrences of these 4-digit IPC codes)?", "external_knowledge": "patents_info.md"}
{"instance_id": "sf_bq222", "db": "PATENTS", "question": "Find the CPC technology areas in Germany that had the highest exponential moving average (smoothing factor 0.1) of patent filings per year, specifically for patents granted in December 2016. For each CPC group at level 4, show the full title, CPC group, and the year with the highest exponential moving average of patent filings.", "external_knowledge": "sliding_windows_calculation_cpc.md"}
{"instance_id": "sf_bq221", "db": "PATENTS", "question": "Identify the CPC technology areas with the highest exponential moving average of patent filings each year (with a smoothing factor of 0.2), considering only the first CPC code for each patent that has a valid filing date and a non-empty application number, and report the full CPC title along with the best year associated with the highest exponential moving average for each CPC group at level 5.", "external_knowledge": "sliding_windows_calculation_cpc.md"}
{"instance_id": "sf_bq223", "db": "PATENTS", "question": "Which assignees, excluding DENSO CORP itself, have cited patents assigned to DENSO CORP, and what are the titles of the primary CPC subclasses associated with these citations? Provide the name of each citing assignee (excluding DENSO CORP), the full title of the primary CPC subclass (based on the first CPC code), and the count of citations grouped by the citing assignee and the CPC subclass title. Ensure that only citations of patents with valid filing dates are considered, and focus on the first CPC code for each citing patent. The results should specifically exclude DENSO CORP as a citing assignee.", "external_knowledge": "patents_info.md"}
{"instance_id": "sf_bq420", "db": "PATENTS_USPTO", "question": "Can you identify the top 5 patents that were initially rejected under section 101 with no allowed claims, based on the length of their granted claims? The patents should have been granted in the US between 2010 and 2023. Additionally, ensure to select the first office action date for each application. Please include their first publication numbers, along with their first publication dates, length of the filed claims and grant dates.", "external_knowledge": null}
{"instance_id": "sf_bq207", "db": "PATENTS_USPTO", "question": "Could you provide the earliest publication numbers, corresponding application numbers, claim numbers, and word counts for the top 100 independent patent claims, based on the highest word count, retrieved from claims stats within uspto_oce_claims (filtered by ind_flg='1'), matched with their publication numbers from uspto_oce_claims match, and further joined with patents publications to ensure only the earliest publication for each application is included, ordered by descending word count, and limited to the top 100 results?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq128", "db": "PATENTSVIEW", "question": "Retrieve the following information for U.S. patents filed between January 1, 2014, and February 1, 2014. The patent title and abstract. The publication date of the patent. The number of backward citations for each patent (i.e., the number of patents cited by the current patent before its filing date). The number of forward citations for each patent within the first 5 years of its publication (i.e., the number of patents that cited the current patent within 5 years after its publication). For each patent, ensure the forward citations are counted only for citations within 5 years after the publication date, and backward citations are counted for citations before the filing date.", "external_knowledge": "forward_backward_citation.md"}
{"instance_id": "sf_bq246", "db": "PATENTSVIEW", "question": "Retrieve U.S. patents with the number of forward citations within the first 3 years after the patent application date (i.e., patents citing the current patent within 3 years). Only include patents with both backward citations within 1 year before the application date and forward citations within 1 year after the application date. The query should focus on specific CPC categories, sort results by backward citations in descending order, and return the patent with the most backward citations, limiting to one result.", "external_knowledge": null}
{"instance_id": "sf_bq052", "db": "PATENTSVIEW", "question": "Retrieve the following information for U.S. patents: The patent ID, title, and application date. The number of backward citations within 1 month before the application date (i.e., patents that cited the current patent before its application). The number of forward citations within 1 month after the application date (i.e., patents that cited the current patent after its application). The abstract text of the patent. Only include patents that belong to specific CPC categories, such as subsection 'C05' or group 'A01G'. The query should filter patents to include only those that have at least one backward citation or one forward citation in the 1-month period specified. Sort the results by application date and return all matching records.", "external_knowledge": null}
{"instance_id": "sf_bq036", "db": "GITHUB_REPOS", "question": "What was the average number of GitHub commits made per month in 2016 for repositories containing Python code?", "external_knowledge": null}
{"instance_id": "sf_bq100", "db": "GITHUB_REPOS", "question": "How can we identify the top 10 most frequently used packages in GitHub repository contents by looking for import statements enclosed in parentheses, splitting any multi-line imports by newlines, extracting package names that appear within double quotes, counting how often these packages appear, ignoring any null results, and finally ordering them in descending order of their frequency? The final answer should remove the quotation marks.", "external_knowledge": null}
{"instance_id": "sf_bq101", "db": "GITHUB_REPOS", "question": "From GitHub Repos contents, how can we identify the top 10 most frequently imported package names in Java source files by splitting each file's content into lines, filtering for valid import statements, extracting only the package portion using a suitable regex, grouping by these extracted package names, counting their occurrences, and finally returning the 10 packages that appear most often in descending order of frequency?", "external_knowledge": null}
{"instance_id": "sf_bq182", "db": "GITHUB_REPOS_DATE", "question": "Which primary programming languages, determined by the highest number of bytes in each repository, had at least 100 PullRequestEvents on January 18, 2023 across all their repositories?", "external_knowledge": null}
{"instance_id": "sf_bq217", "db": "GITHUB_REPOS_DATE", "question": "On January 18, 2023, how many pull requests were created in GitHub repositories that include JavaScript as one of their programming languages, according to the data from githubarchive 20230118 and the language records in github_repos languages?", "external_knowledge": null}
{"instance_id": "sf_bq191", "db": "GITHUB_REPOS_DATE", "question": "From the 2017 GitHub WatchEvent data, find the top two repositories that have more than 300 distinct watchers, ensuring the results are joined with the 'sample_files' table so that we return each repository's name along with its distinct watcher count, and limit the output to the two repositories with the highest watcher counts.", "external_knowledge": null}
{"instance_id": "sf_bq224", "db": "GITHUB_REPOS_DATE", "question": "Which repository with an approved license in `licenses.md` had the highest combined total of forks, issues, and watches in April 2022?", "external_knowledge": null}
{"instance_id": "sf_bq192", "db": "GITHUB_REPOS", "question": "Find the most active Python repository on GitHub based on watcher count, issues, and forks. The query should select repositories with specific open-source licenses (`artistic-2.0`, `isc`, `mit`, `apache-2.0`), count distinct watchers, issue events, and forks for each repository in April 2022, and include only those with `.py` files on the `master` branch. Join the license data with watch counts, issue events, and fork counts, then sort by a combined metric of forks, issues, and watches, returning the name and count of the most active repository.", "external_knowledge": null}
{"instance_id": "sf_bq225", "db": "GITHUB_REPOS", "question": "From the GitHub repository files in 'github_repos.sample_files' joined with 'github_repos.sample_contents', which 10 programming languages occur most frequently (based on recognized file extensions) in files that have non-empty content, ordered by their file counts in descending order?", "external_knowledge": "lang_and_ext.md"}
{"instance_id": "sf_bq180", "db": "GITHUB_REPOS", "question": "Get the top 5 most frequently used module names from Python (`.py`) and R (`.r`) scripts, counting occurrences of modules in `import` and `from` statements for Python, and `library()` calls for R. The query should consider only Python and R files, group by module name, and return the top 5 modules ordered by frequency.", "external_knowledge": null}
{"instance_id": "sf_bq233", "db": "GITHUB_REPOS", "question": "Can you analyze the joined data from github repos files and github_repos contents, focusing only on files ending with '.py' or '.r', then extract Python modules from 'import' or 'from ... import' lines and R libraries from 'library(...)' lines, count their occurrences, and finally list the results sorted by language and by the number of occurrences in descending order?", "external_knowledge": null}
{"instance_id": "sf_bq248", "db": "GITHUB_REPOS", "question": "Among all repositories that do not use any programming language whose name (case-insensitively) includes the substring \"python,\" what is the proportion of files whose paths include \"readme.md\" and whose contents contain the phrase \"Copyright (c)\"?", "external_knowledge": null}
{"instance_id": "sf_bq193", "db": "GITHUB_REPOS", "question": "Retrieve all non-empty, non-commented lines from `README.md` files in GitHub repositories, excluding lines that are comments (either starting with `#` for Markdown or `//` for code comments). For each line, calculate how often each unique line appears across all repositories and return a comma-separated list of the programming languages used in each repository containing that line, sorted alphabetically, with the results ordered by the frequency of occurrence in descending order.", "external_knowledge": null}
{"instance_id": "sf_bq295", "db": "GITHUB_REPOS_DATE", "question": "Using the 2017 GitHub Archive data for watch events, which three repositories that include at least one Python file (with a .py extension) smaller than 15,000 bytes and containing the substring \"def \" in its content have the highest total number of watch events for that year?", "external_knowledge": null}
{"instance_id": "sf_bq249", "db": "GITHUB_REPOS", "question": "Please provide a report on the number of occurrences of specific line types across files from the GitHub repository. Categorize a line as 'trailing' if it ends with a blank character, as 'Space' if it starts with a space, and as 'Other' if it meets neither condition. The report should include the total number of occurrences for each category, considering all lines across all files.", "external_knowledge": null}
{"instance_id": "sf_bq375", "db": "GITHUB_REPOS", "question": "Determine which file type among Python (.py), C (.c), Jupyter Notebook (.ipynb), Java (.java), and JavaScript (.js) in the GitHub codebase has the most files with a directory depth greater than 10, and provide the file count.", "external_knowledge": null}
{"instance_id": "sf_bq255", "db": "GITHUB_REPOS", "question": "How many commit messages are there in repositories that use the 'Shell' programming language and 'apache-2.0' license, where the length of the commit message is more than 5 characters but less than 10,000 characters, and the messages do not start with the word 'merge', 'update' or 'test'?", "external_knowledge": null}
{"instance_id": "sf_bq194", "db": "GITHUB_REPOS", "question": "Among all Python (*.py), R (*.r, *.R, *.Rmd, *.rmd), and IPython notebook (*.ipynb) files in the GitHub sample dataset, which library or module is identified as the second most frequently imported or loaded based on the extracted import statements?", "external_knowledge": null}
{"instance_id": "sf_bq377", "db": "GITHUB_REPOS", "question": "Extract and count the frequency of all package names listed in the require section of JSON-formatted content", "external_knowledge": null}
{"instance_id": "sf_bq359", "db": "GITHUB_REPOS", "question": "List the repository names and commit counts for the top two GitHub repositories with JavaScript as the primary language and the highest number of commits.", "external_knowledge": null}
{"instance_id": "sf_bq252", "db": "GITHUB_REPOS", "question": "Could you please find the name of the repository that contains the most copied non-binary Swift file in the dataset, ensuring each file is uniquely identified by its ID?", "external_knowledge": null}
{"instance_id": "sf_bq251", "db": "PYPI", "question": "I want to know the GitHub project URLs for the top 3 most downloaded PyPI packages. First, extract the relevant information from PyPI package metadata, including the project URLs. Filter out only those URLs that link to GitHub. Use a regular expression to clean the GitHub URLs, removing unnecessary parts like `issues`, `pull`, `blob`, and `tree`, leaving only the main repository URL. Then, return the GitHub project URLs of the top 3 most downloaded packages, ensuring that only records with valid GitHub URLs are included. Do not put quotation marks around the final URL.", "external_knowledge": null}
{"instance_id": "bq019", "db": "cms_data", "question": "In the 2014 CMS Medicare inpatient charges data, which DRG definition has the highest total number of discharges, and among the top three cities with the most discharges for that DRG definition, what are their respective weighted average total payments (weighted by total discharges)", "external_knowledge": null}
{"instance_id": "bq234", "db": "cms_data", "question": "What is the most prescribed medication in each state in 2014?", "external_knowledge": null}
{"instance_id": "bq235", "db": "cms_data", "question": "Can you tell me which healthcare provider incurs the highest combined average costs for both outpatient and inpatient services in 2014?", "external_knowledge": null}
{"instance_id": "bq172", "db": "cms_data", "question": "For the drug with the highest total number of prescriptions in New York State during 2014, could you list the top five states with the highest total claim counts for this drug? Please also include their total claim counts and total drug costs. ", "external_knowledge": null}
{"instance_id": "bq177", "db": "cms_data", "question": "For the provider whose total inpatient Medicare cost from 2011 through 2015 is the highest (computed as the sum of average_medicare_payments multiplied by total_discharges), please list that provider’s yearly average inpatient cost and yearly average outpatient cost for each calendar year in this period, where the inpatient cost is calculated as the average of (average_medicare_payments × total_discharges) and the outpatient cost is calculated as the average of (average_total_payments × outpatient_services).", "external_knowledge": null}
{"instance_id": "bq354", "db": "cms_data", "question": "Could you provide the percentage of participants for standard acne, atopic dermatitis, psoriasis, and vitiligo as defined by the International Classification of Diseases 10-CM (ICD-10-CM), including their subcategories? Please include all related concepts mapped to the standard ICD-10-CM codes (L70 for acne, L20 for atopic dermatitis, L40 for psoriasis, and L80 for vitiligo) by utilizing concept relationships, including descendant concepts. The percentage should be calculated based on the total number of participants, considering only the standard concepts and their related descendants.", "external_knowledge": null}
{"instance_id": "bq355", "db": "cms_data", "question": "Please tell me the percentage of participants not using quinapril and related medications(Quinapril RxCUI: 35208).", "external_knowledge": null}
{"instance_id": "bq032", "db": "noaa_data", "question": "Can you provide the latitude of the final coordinates for the hurricane that traveled the second longest distance in the North Atlantic during 2020?", "external_knowledge": "functions_st_distance.md"}
{"instance_id": "bq119", "db": "noaa_data", "question": "Please show information about the hurricane with the third longest total travel distance in the North Atlantic during 2020, including its travel coordinates, the cumulative travel distance (in kilometers) at each point, and the maximum sustained wind speed at those times.", "external_knowledge": "functions_st_distance.md"}
{"instance_id": "sf_bq117", "db": "NOAA_DATA", "question": "What is the total number of severe storm events that occurred in the most affected month over the past 15 years according to NOAA records, considering only the top 100 storm events with the highest property damage?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "bq419", "db": "noaa_data", "question": "Which 5 states had the most storm events from 1980 to 1995, considering only the top 1000 states with the highest event counts each year? Please use state abbreviations.", "external_knowledge": null}
{"instance_id": "sf_bq071", "db": "NOAA_DATA_PLUS", "question": "Can you provide the count of hurricanes and a list of hurricane names (sorted alphabetically and separated by commas) for each city and its associated zip code, where the hurricanes fall within the boundaries of the zip codes? Please exclude any unnamed hurricanes, and sort the results by the count of hurricanes in descending order. The output should include the following columns: city, zip code, state, count of hurricanes, and the list of hurricanes.", "temporal": "Yes", "external_knowledge": "functions_st_within.md"}
{"instance_id": "sf_bq236", "db": "NOAA_DATA_PLUS", "question": "What are the top 5 zip codes of the areas in the United States that have experienced the most hail storm events in the past 10 years? Don't use data from hail reports table.", "temporal": "Yes", "external_knowledge": "functions_st_within.md"}
{"instance_id": "bq356", "db": "noaa_data", "question": "Among all NOAA GSOD weather stations that recorded valid daily temperature data (non-missing temp, max, min) in 2019 and whose period of record began on or before January 1, 2000, and continued through at least June 30, 2019, how many of these stations achieved 90% or more of the maximum possible number of valid temperature-record days in 2019?", "external_knowledge": null}
{"instance_id": "bq042", "db": "noaa_data", "question": "Can you help me retrieve the average temperature, average wind speed, and precipitation for LaGuardia Airport in NYC on June 12 for each year from 2011 through 2020, specifically using the station ID 725030?", "external_knowledge": null}
{"instance_id": "bq394", "db": "noaa_data", "question": "What are the top 3 months between 2010 and 2014 with the smallest sum of absolute differences between the average air temperature, wet bulb temperature, dew point temperature, and sea surface temperature, including respective years and sum of differences? Please present the year and month in numerical format.", "external_knowledge": null}
{"instance_id": "bq357", "db": "noaa_data", "question": "What are the latitude and longitude coordinates and dates between 2005 and 2015 with the top 5 highest daily average wind speeds, excluding records with missing wind speed values? Using data from tables start with prefix \"icoads_core\".", "external_knowledge": null}
{"instance_id": "bq181", "db": "noaa_data", "question": "What percentage of weather stations recorded valid temperature data (with no missing or invalid values) for at least 90% of the days in 2022, where the temperature, maximum, and minimum values are neither NULL nor equal to 9999.9, and the station has a valid identifier (USAF code not equal to '999999'), out of all available stations in the NOAA GSOD database?", "external_knowledge": null}
{"instance_id": "bq045", "db": "noaa_data", "question": "Which weather stations in Washington State recorded more than 150 rainy days in 2023 but fewer rainy days compared to 2022? Defining a “rainy day” as one having precipitation greater than zero millimeters and not equal to 99.99. Only include stations with valid precipitation data.", "external_knowledge": null}
{"instance_id": "sf_bq358", "db": "NEW_YORK_CITIBIKE_1", "question": "Can you tell me which bike trip in New York City on July 15, 2015, started and ended in ZIP Code areas with the highest average temperature for that day, as recorded by the Central Park weather station '94728'? If there's more than one trip that meets these criteria, I'd like to know about the one that starts in the smallest ZIP Code and ends in the largest ZIP Code.", "temporal": "Yes", "external_knowledge": "functions_st_within.md"}
{"instance_id": "bq290", "db": "noaa_data", "question": "Can you calculate the difference in maximum temperature, minimum temperature, and average temperature between US and UK weather stations for each day in October 2023, using the date field, and excluding records with missing or invalid temperature values?", "external_knowledge": null}
{"instance_id": "bq031", "db": "noaa_data", "question": "Provide the daily weather data for Rochester from January 1 to March 31, 2019, including temperature (in Celsius), precipitation (in centimeters), and wind speed (in meters per second). For each variable, calculate the 8-day moving average (including the current day and the previous 7 days). Also, calculate the difference between the moving average on each day and the moving averages for the previous 1 to 8 days (i.e., lag1 to lag8). The result should include: The daily values for temperature, precipitation, and wind speed.The 8-day moving averages for each variable. The differences between the moving averages for each of the previous 1 to 8 days (e.g., the difference between today's moving average and the moving average from 1 day ago, from 2 days ago, and so on). Round all values to one decimal place. The data should be ordered by date, starting from January 9, 2019.", "external_knowledge": null}
{"instance_id": "bq392", "db": "noaa_gsod", "question": "What are the top 3 dates in October 2009 with the highest average temperature for station number 723758, in the format YYYY-MM-DD?", "external_knowledge": null}
{"instance_id": "sf_bq050", "db": "NEW_YORK_CITIBIKE_1", "question": "I want to analyze bike trips in New York City for 2014 by linking trip data with weather information to understand how weather conditions (temperature, wind speed, and precipitation) affect bike trips between neighborhoods. For each combination of starting and ending neighborhoods, I need the following: 1. Total number of bike trips between the neighborhoods. 2. Average trip duration in minutes (rounded to 1 decimal). 3. Average temperature at the start of the trip (rounded to 1 decimal). 4. Average wind speed at the start (in meters per second, rounded to 1 decimal). 5. Average precipitation at the start (in centimeters, rounded to 1 decimal). 6. The month with the most trips (e.g., `4` for April). The data should be grouped by the starting and ending neighborhoods, with:`zip_codes` in `geo_us_boundaries` used to map the bike trip locations based on latitude and longitude. `zip_codes` in `cyclistic` used to obtain the borough and neighborhood names. Using weather data from the Central Park station for the trip date, covering all trips in 2014.", "external_knowledge": "functions_st_within.md"}
{"instance_id": "sf_bq426", "db": "NEW_YORK_CITIBIKE_1", "question": "What user type recorded the highest average temperature for trips starting and ending in New York City's zip code 10019 during 2018? Include average precipitation, wind speed, and temperature for that user type based on weather data from the New York Central Park station.", "external_knowledge": "functions_st_within.md"}
{"instance_id": "sf_bq291", "db": "NOAA_GLOBAL_FORECAST_SYSTEM", "question": "Can you provide a daily weather summary for July 2019 within a 5 km radius of latitude 26.75 and longitude 51.5? I need the maximum, minimum, and average temperatures; total precipitation; average cloud cover between 10 AM and 5 PM; total snowfall (when average temperature is below 32°F); and total rainfall (when average temperature is 32°F or above) for each forecast date. The data should correspond to forecasts created in July 2019 for the following day.", "external_knowledge": "functions_st_within.md"}
{"instance_id": "bq208", "db": "new_york_noaa", "question": "Can you provide weather stations within a 20-mile radius of Chappaqua, New York (Latitude: 41.197, Longitude: -73.764), and tell me the number of valid temperature observations they have recorded from 2011 to 2020, excluding any invalid or missing temperature data?", "external_knowledge": "functions_st_dwithin.md"}
{"instance_id": "bq047", "db": "new_york_noaa", "question": "Could you analyze the relationship between each complaint type and daily temperature in New York City, specifically using temperature data from LaGuardia (STN=725030) and JFK (STN=744860) airports for the 10 years starting in 2008, and then determine, for each complaint type that has more than 5000 total occurrences and shows a strong correlation (absolute value > 0.5) with temperature, the total number of complaints, the total number of days with valid temperature records, and the Pearson correlation coefficients (rounded to four decimals) between temperature and both the daily complaint count as well as the daily percentage of total complaints, excluding any days with missing or invalid temperature data (such as 9999.9)?", "external_knowledge": null}
{"instance_id": "bq048", "db": "new_york_noaa", "question": "Which complaint types with more than 3000 total requests from 2011 to 2020 show the strongest positive and negative Pearson correlations with the daily average wind speed measured at station 744860 (JFK Airport), based on daily complaint proportions (the ratio of type-specific complaints to total daily complaints)? Please provide the complaint types and their correlation coefficients, rounded to four decimal places.", "external_knowledge": null}
{"instance_id": "bq293", "db": "new_york_geo", "question": "I want to analyze yellow taxi trip data in New York City for January 2015, focusing on the number of trips by ZIP code and hour of the day. The query should (1) calculate the total number of taxi trips for each ZIP code per hour, (2) join the taxi trip data with ZIP code boundaries to ensure each trip is correctly mapped to its respective neighborhood based on latitude and longitude, and (3) compute several time-based metrics, including the number of trips from 1 hour ago (lag 1h), 1 day ago (lag 1d), 7 days ago (lag 7d), and 14 days ago (lag 14d). Additionally, the query should calculate the 14-day and 21-day moving averages and standard deviations of trip counts, excluding the current hour. The final results should include the following values: the total number of trips, the lag counts (1h, 1d, 7d, and 14d), the moving averages for 14 and 21 days and the standard deviations for 14 and 21 days. The data should be sorted by the highest trip counts, showing the top 15 groups with the most trips.", "external_knowledge": "functions_st_contains.md"}
{"instance_id": "sf_bq017", "db": "GEO_OPENSTREETMAP", "question": "What are the five longest types of highways within the multipolygon boundary of Denmark (as defined by Wikidata ID 'Q35') by total length, analyzed through planet features?", "external_knowledge": "functions_st_dwithin.md"}
{"instance_id": "sf_bq131", "db": "GEO_OPENSTREETMAP", "question": "What is the number of bus stops for the bus network with the most stops within the multipolygon boundary of Denmark (as defined by Wikidata ID 'Q35'), analyzed through planet features?", "external_knowledge": "functions_st_dwithin.md"}
{"instance_id": "sf_bq349", "db": "GEO_OPENSTREETMAP", "question": "Which OpenStreetMap ID from the planet features table corresponds to an administrative boundary, represented as multipolygons, whose total number of 'amenity'-tagged Points of Interest (POIs), as derived from the planet nodes table, is closest to the median count among all such boundaries?", "external_knowledge": "functions_st_dwithin.md"}
{"instance_id": "sf_bq007", "db": "CENSUS_BUREAU_ACS_2", "question": "Identify the top 10 U.S. states with the highest vulnerable population, calculated based on a weighted sum of employment sectors using 2017 ACS 5-Year data, and determine their average median income change from 2015 to 2018 using zip code data. ", "external_knowledge": "total_vulnerable_weights.md"}
{"instance_id": "sf_bq429", "db": "CENSUS_BUREAU_ACS_2", "question": "Which are the top five states with the greatest average difference in median income between 2015 and 2018 at the ZIP code level, and what is the corresponding average number of vulnerable employees across wholesale trade, natural resources and construction, arts and entertainment, information, and retail trade industries in 2017 according to the ACS Five-Year Estimates and ZIP code boundaries data?", "external_knowledge": "avg_vulnerable_weights.md"}
{"instance_id": "sf_bq073", "db": "CENSUS_BUREAU_ACS_2", "question": "Using data on ZIP-level median income differences between 2015 and 2018, along with the 2017 ACS employment figures, list each state in descending order of total vulnerable workers, where “vulnerable” is defined as 38% of wholesale trade employees and 41% of manufacturing employees in 2017. Your results should include the state name, the number of vulnerable wholesale trade workers, the number of vulnerable manufacturing workers, and the combined total of these vulnerable workers.", "external_knowledge": null}
{"instance_id": "sf_bq410", "db": "CENSUS_BUREAU_ACS_2", "question": "Show the top 3 states with the smallest adjusted non-labor force population, calculated as the sum of the unemployed population, not in labor force population, and group quarters population (with negative results clamped to zero). For these states, provide the total median income change between 2015 and 2018, the adjusted non-labor force count from 2017 ACS data, and the population-adjusted proportion (calculated as the ratio of adjusted non-labor force to total population). Use 2017 ACS tract-level data, match tracts to states via FIPS code prefixes, and exclude tracts with null income differences. The states should be ordered by the smallest adjusted non-labor force count, and results should include state abbreviations with aggregated values.", "external_knowledge": null}
{"instance_id": "sf_bq348", "db": "GEO_OPENSTREETMAP", "question": "Within the rectangular area defined by the geogpoints (31.1798246, 18.4519921), (54.3798246, 18.4519921), (54.3798246, 33.6519921), and (31.1798246, 33.6519921), which are the top three usernames responsible for the highest number of historical nodes, originally tagged with the amenities ‘hospital’, ‘clinic’, or ‘doctors’, that do not appear anymore in the current planet_nodes dataset?", "external_knowledge": "functions_st_intersects_polygon_line.md"}
{"instance_id": "sf_bq253", "db": "GEO_OPENSTREETMAP", "question": "Find the name of the OpenStreetMap relation that encompasses the most features within the same geographic area as the multipolygon tagged with the Wikidata item 'Q1095'. The relation should have a specified name and no 'wikidata' tag, and at least one of its included features must have a 'wikidata' tag. The analysis should be conducted using the planet_features table. Return the name of this relation.", "external_knowledge": "functions_st_dwithin.md"}
{"instance_id": "sf_bq254", "db": "GEO_OPENSTREETMAP", "question": "Among all multipolygons located within the same geographic area as the multipolygon associated with Wikidata item Q191, but lacking a 'wikidata' tag themselves, which two rank highest by the number of points that lie within their boundaries, and what are their names?", "external_knowledge": "functions_st_dwithin.md"}
{"instance_id": "sf_bq056", "db": "GEO_OPENSTREETMAP_BOUNDARIES", "question": "How many different pairs of roads classified as motorway, trunk, primary, secondary, or residential in California overlap each other without sharing nodes and do not have a bridge tag, where these roads are tagged with 'highway', analyzed through planet ways", "external_knowledge": "functions_st_intersects.md"}
{"instance_id": "sf_bq289", "db": "GEO_OPENSTREETMAP_CENSUS_PLACES", "question": "Can you find the shortest distance between any two amenities (either a library, place of worship, or community center) located within Philadelphia, analyzed through pennsylvania table and planet features points?", "external_knowledge": "functions_st_contains.md"}
{"instance_id": "sf_bq250", "db": "GEO_OPENSTREETMAP_WORLDPOP", "question": "Based on the most recent 1km population grid data in Singapore before January 2023, using ST_CONVEXHULL to aggregate all population grid centroids into a bounding region and ST_INTERSECTS to identify hospitals from OpenStreetMap’s planet layer (layer_code in (2110, 2120)) that fall within this region, then calculating the distance from each grid cell to its nearest hospital, what is the total population of the grid cell that is farthest from any hospital?", "external_knowledge": "OpenStreetMap_data_in_layered_GIS_format.md"}
{"instance_id": "sf_bq083", "db": "CRYPTO", "question": "Can you calculate the daily change in the market value of USDC tokens (address `0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48`) for 2023, based on Ethereum transactions? The change should be computed from minting (input pattern `0x40c10f19%`) and burning (input pattern `0x42966c68%`) operations. For each transaction, minting should be positive and burning negative. Extract the relevant amount from the 'input' field as a hexadecimal, convert it to millions, express it in USD format. Group the results by date and order them in descending order.", "external_knowledge": "Total_Market_Value_Change.md"}
{"instance_id": "sf_bq184", "db": "CRYPTO", "question": "Using only the traces, can you calculate daily cumulative counts of smart contracts created by external addresses (where the trace_address is NULL) versus those created by other contracts (where the trace_address is NOT NULL) for each date from 2017-01-01 through 2021-12-31, ensuring that all dates in this range are included even if no new contracts were created on some days, and showing monotonically increasing cumulative totals for both categories?", "external_knowledge": null}
{"instance_id": "sf_bq195", "db": "CRYPTO", "question": "What are the top 10 Ethereum addresses by balance, considering both value transactions and gas fees, before September 1, 2021? Only keep successful transactions with no call type or where the call type is 'call'.", "external_knowledge": null}
{"instance_id": "sf_bq256", "db": "CRYPTO", "question": "Determine the final Ether balance of the Ethereum address that initiated the highest number of successful transactions prior to September 1, 2021 (UTC), excluding calls of type delegatecall, callcode, or staticcall and including all relevant incoming and outgoing transfers, miner rewards, and gas fee deductions, with the final balance presented in Ether after converting from the native unit.", "external_knowledge": null}
{"instance_id": "sf_bq080", "db": "CRYPTO", "question": "Using only the Ethereum traces table, can you provide a daily cumulative count of smart contracts created by external users (where trace_address is null) versus contracts created by other contracts (where trace_address is not null) between August 30, 2018, and September 30, 2018? Ensure results include every date in this range, even if no new contracts were created, and show strictly increasing cumulative totals.", "external_knowledge": null}
{"instance_id": "sf_bq342", "db": "CRYPTO", "question": "What is the difference between the average hourly changes in transaction values for the Ethereum token 0x68e54af74b22acaccffa04ccaad13be16ed14eac, specifically considering only transactions where the address 0x8babf0ba311aab914c00e8fda7e8558a8b66de5d was the sender or the address 0xfbd6c6b112214d949dcdfb1217153bc0a742862f was the receiver, between January 1, 2019, and December 31, 2020, when comparing 2019 to 2020?", "external_knowledge": null}
{"instance_id": "sf_bq341", "db": "CRYPTO", "question": "Which Ethereum address has the top 3 smallest positive balance from transactions involving the token at address \"0xa92a861fc11b99b24296af880011b47f9cafb5ab\"?", "external_knowledge": null}
{"instance_id": "sf_bq444", "db": "CRYPTO", "question": "Can you pull the blockchain timestamp, block number, and transaction hash for the first five mint and burn events from Ethereum logs for the address '0x8ad599c3a0ff1de082011efddc58f1908eb6e6d8'? Please include mint events identified by the topic '0x7a53080ba414158be7ec69b987b5fb7d07dee101fe85488f0853ae16239d0bde' and burn events by '0x0c396cd989a39f4459b5fa1aed6a9a8dcdbc45908acfd67e028cd568da98982c', and order them by block timestamp from the oldest to the newest.", "external_knowledge": "ethereum_logs_and_events_overview.md"}
{"instance_id": "sf_bq340", "db": "CRYPTO", "question": "Which six Ethereum addresses, excluding '0x0000000000000000000000000000000000000000', have the largest absolute differences between their previous and current balances from the tokens at addresses '0x0d8775f648430679a709e98d2b0cb6250d2887ef0' and '0x1e15c05cbad367f044cbfbafda3d9a1510db5513'?", "external_knowledge": null}
{"instance_id": "sf_bq005", "db": "CRYPTO", "question": "Calculate the daily average Bitcoin block interval (in seconds) for 2023 by joining consecutive blocks via row-numbered self-joins (including cross-day intervals), excluding the genesis block, and list the first 10 dates with their unadjusted averages.", "external_knowledge": null}
{"instance_id": "sf_bq334", "db": "CRYPTO", "question": "Calculate the annual differences in Bitcoin output value averages between two methods: Merged input/output records: Combine the inputs and outputs tables, filter to only output records, and calculate yearly averages. Transactions table: Directly use the output_value field from the transactions table for yearly averages. Show the difference (merged outputs average minus transactions average) only for years with data in both methods.", "external_knowledge": null}
{"instance_id": "sf_bq335", "db": "CRYPTO", "question": "Among all Bitcoin addresses that have at least one transaction in October 2017 (combining both inputs and outputs), which address conducted its final transaction on the latest date in that month, and, among any addresses sharing that same latest date, which one has the highest sum of transaction values?", "external_knowledge": null}
{"instance_id": "sf_bq057", "db": "CRYPTO", "question": "Which month (e.g., 3) in 2021 witnessed the highest percent of Bitcoin volume that took place in CoinJoin transactions? Also give me the percentage of CoinJoins transactions, the average input and output UTXOs ratio, and the proportion of CoinJoin transaction volume for that month (all 1 decimal).", "external_knowledge": null}
{"instance_id": "sf_bq068", "db": "CRYPTO", "question": "Using double-entry bookkeeping principles by treating transaction inputs as debits (negative values) and outputs as credits (positive values) for all Bitcoin Cash transactions between 2014-03-01 and 2014-04-01, how can we calculate the maximum and minimum final balances grouped by address type from these transactions?", "external_knowledge": null}
{"instance_id": "sf_bq092", "db": "CRYPTO", "question": "In April 2023, considering Dash transaction data tracked using double-entry bookkeeping to separately account for debits and credits from each address, what are the highest and lowest resulting balances across all addresses when filtering records by their block timestamps for that month?", "external_knowledge": null}
{"instance_id": "sf_bq093", "db": "CRYPTO", "question": "Tell me the maximum and minimum net changes in balances for Ethereum Classic addresses on October 14, 2016, calculated by summing debits (values sent to addresses, excluding internal calls), credits (values sent from addresses, excluding internal calls), and gas fees (total gas used multiplied by the gas price for both miners and senders), while only considering successful transactions", "external_knowledge": null}
{"instance_id": "sf_bq292", "db": "CRYPTO", "question": "Analyze Bitcoin transactions since July 2023 to determine monthly percentages of: (1)Transactions classified as CoinJoins (defined by >2 outputs, output value ≤ input value, and multiple identical-value outputs), (2) UTXOs involved in CoinJoins (calculated as the average of CoinJoin input/output ratios against total network UTXOs), (3) Transaction volume (based on input value) attributed to CoinJoins. Provide results in a table with monthly metrics for transactions, UTXOs, and volume.", "external_knowledge": null}
{"instance_id": "sf_bq135", "db": "CRYPTO", "question": "Which date before 2022 had the highest total transaction amount in the Zilliqa blockchain data?", "external_knowledge": null}
{"instance_id": "sf_bq136", "db": "CRYPTO", "question": "Find all 2-hop transaction paths on Zilliqa between addresses `zil1jrpjd8pjuv50cfkfr7eu6yrm3rn5u8rulqhqpz` (source) and `zil19nmxkh020jnequql9kvqkf3pkwm0j0spqtd26e` (destination) since the first transaction date. Exclude paths where intermediate addresses have >50 outgoing transactions (to filter exchanges/high-activity wallets). Ensure paths: (1) Follow chronological order (earlier transaction timestamps first), (2) Use the format: `<from> --(tx ABCDE..)--> <intermediate> --(tx FGHIJ..)--> <to>`, (3) Only show paths where both transactions are on-chain confirmed.", "external_knowledge": null}
{"instance_id": "sf_bq065", "db": "CRYPTO", "question": "Retrieve the 10 most recent oracle requests with script ID 3, and for each request: Extract all symbol-rate pairs from the decoded result, adjust each rate by dividing it by the request’s multiplier, return the block timestamp, request ID, symbol, and adjusted rate. Prioritize chronological order (newest first).", "external_knowledge": null}
{"instance_id": "sf_bq037", "db": "HUMAN_GENOME_VARIANTS", "question": "About the refined human genetic variations collected in phase 3 on 2015-02-20, I want to know the minimum and maximum start positions as well as the proportions of these two respectively for reference bases 'AT' and 'TA'.", "external_knowledge": null}
{"instance_id": "sf_bq012", "db": "ETHEREUM_BLOCKCHAIN", "question": "Calculate the average balance (in quadrillions, 10^15) of the top 10 Ethereum addresses by net balance, including incoming and outgoing transfers from traces (only successful transactions and excluding call types like delegatecall, callcode, and staticcall), miner rewards (sum of gas fees per block), and sender gas fee deductions. Exclude null addresses and round the result to two decimal places.", "external_knowledge": null}
{"instance_id": "sf_bq187", "db": "ETHEREUM_BLOCKCHAIN", "question": "Calculate the total circulating supply of 'BNB' tokens (in units divided by 10^18) by summing balances of all non-zero addresses, where each address’s balance equals its total received BNB minus sent BNB. Exclude transactions involving the zero address (0x000...) for both senders and receivers.", "external_knowledge": null}
{"instance_id": "sf_bq450", "db": "ETHEREUM_BLOCKCHAIN", "question": "Generate a comprehensive report of all Ethereum addresses active before January 1, 2017, calculating their net balances (adjusted for transaction fees and excluding delegatecall/callcode/staticcall transactions), hourly activity patterns, active days, incoming/outgoing transaction metrics (counts, unique counterparties, average ETH transfers), ERC20 token interactions (in/out counts, unique tokens, counterparties), mining rewards, contract creation frequency, failed transaction counts, and contract bytecode sizes, with all ETH values converted to standard units (divided by 10^18) and excluding addresses with no transaction history.", "external_knowledge": "ethereum_data_transformation.md"}
{"instance_id": "bq034", "db": "ghcn_d", "question": "I want to know the IDs, names of weather stations within a 50 km straight-line distance from the center of Chicago (41.8319°N, 87.6847°W)", "external_knowledge": null}
{"instance_id": "bq383", "db": "ghcn_d", "question": "Could you provide the highest recorded precipitation, minimum temperature, and maximum temperature from the last 15 days of each year from 2013 to 2016 at weather station USW00094846? Ensure each value represents the peak measurement for that period, with precipitation in millimeters and temperatures in degrees Celsius, using only validated data (non-null values and no quality flags)", "external_knowledge": null}
{"instance_id": "bq051", "db": "new_york_ghcn", "question": "Calculate the average daily number of Citibike trips in New York City during 2016, categorizing days as rainy if the total precipitation exceeds 5 millimeters (obtained by dividing the raw precipitation value by 10), and non-rainy otherwise. Use data from the nearest GHCN station located within 50 km of (40.7128, -74.0060) that has valid, unflagged measurements, then compare the resulting average Citibike trips on rainy days versus non-rainy days.", "external_knowledge": null}
{"instance_id": "bq038", "db": "new_york", "question": "Identify the top 10 Citibike stations by highest proportion of group rides, defined as trips starting and ending at the same station where multiple riders departed/arrived within the same 2-minute time window", "external_knowledge": null}
{"instance_id": "bq053", "db": "new_york", "question": "Calculate the change in the number of living trees of each fall color in New York City from 1995 to 2015 by computing, for each tree species, the difference between the number of trees not marked as dead in 1995 and the number of trees alive in 2015, matching species by the uppercase form of their scientific names from the tree_species table. Then, group the species by their fall color and sum these differences to determine the total change in the number of trees for each fall color.", "external_knowledge": null}
{"instance_id": "bq054", "db": "new_york", "question": "Please provide the top 10 tree species in New York, using their uppercase Latin names where the Latin name is not empty and including their common names, showing the total number of trees, the counts of alive and dead trees for each year, and the corresponding growth in these counts from 1995 to 2015, then order by the difference in total tree counts between these years.", "external_knowledge": null}
{"instance_id": "bq021", "db": "new_york", "question": "For the top 20 Citi Bike routes in 2016, which route is faster than yellow taxis and among those, which one has the longest average bike duration? Please provide the start station name of this route. The coordinates are rounded to three decimals.", "external_knowledge": null}
{"instance_id": "bq202", "db": "new_york", "question": "For the station that had the highest number of Citibike trips starting there in 2018, which numeric day of the week and which hour of the day had the greatest number of trips based on the start time of those trips?", "external_knowledge": null}
{"instance_id": "bq185", "db": "new_york_plus", "question": "What is the average trip duration in minutes for all valid Yellow taxi trips that took place between February 1, 2016, and February 7, 2016 (inclusive), with a positive trip duration, more than three passengers, and a trip distance of at least ten miles, where both the pickup and dropoff locations are in Brooklyn?", "external_knowledge": null}
{"instance_id": "bq040", "db": "new_york_plus", "question": "For NYC yellow taxi trips between January 1 and January 7, 2016, excluding any trips picked up in ‘EWR’ or ‘Staten Island,’ determine the proportion of rides that fall into each tip category in each pickup borough. Only include trips where the dropoff time is after the pickup time, the passenger count is greater than zero, and trip_distance, tip_amount, tolls_amount, mta_tax, fare_amount, and total_amount are all non-negative. Classify the tip percentage as follows: 0% (no tip), up to 5%, 5% to 10%, 10% to 15%, 15% to 20%, 20% to 25%, and more than 25%.", "external_knowledge": "taxi_tip_rate.md"}
{"instance_id": "bq098", "db": "new_york_plus", "question": "For NYC yellow taxi trips where both the pickup and dropoff occurred between January 1 and 7, 2016, inclusive, calculate the percentage of trips with no tip in each pickup borough, ensuring that only trips where the dropoff occurs after the pickup are included, the passenger count is greater than zero, and the trip distance, tip amount, tolls amount, MTA tax, fare amount, and total amount are non-negative; define \"no tip\" trips as those where the tip rate is zero, with the tip rate calculated as (tip_amount × 100) divided by total_amount (and considered zero when total_amount is zero).", "external_knowledge": "taxi_tip_rate.md"}
{"instance_id": "bq039", "db": "new_york_plus", "question": "Find the top 10 taxi trips in New York City between July 1 and July 7, 2016 (ensuring both pickup and dropoff times fall within these dates) where the passenger count is greater than five, the trip distance is at least ten miles, and there are no negative fare-related amounts (including tip, tolls, mta tax, fare, and total costs). Exclude any trips where the dropoff time is not strictly after the pickup time, then sort the results by total fare amount in descending order. Finally, display each trip’s pickup zone, dropoff zone, trip duration in seconds, driving speed in miles per hour, and tip rate as a percentage of the total fare amount.", "external_knowledge": null}
{"instance_id": "bq203", "db": "new_york_plus", "question": "For each New York City borough, how many subway stations are there in total, how many have at least one entrance that is marked both as an actual entry and as ADA-compliant, and what percentage of the total stations in each borough does this represent, listing boroughs from the highest to the lowest percentage?", "external_knowledge": null}
{"instance_id": "bq035", "db": "san_francisco", "question": "What is the total distance traveled by each bike in the San Francisco Bikeshare program? Use data from bikeshare trips and stations to calculate this.", "external_knowledge": null}
{"instance_id": "bq186", "db": "san_francisco", "question": "Please find, for each year-month combination (in the format YYYYMM) derived from the start date of bike share trips in San Francisco, the first trip duration in minutes, the last trip duration in minutes, the highest trip duration in minutes, and the lowest trip duration in minutes, where ‘first’ and ‘last’ are determined by the chronological order of the trip start date, then group your results by this year-month and sort them by the same year-month key.", "external_knowledge": null}
{"instance_id": "bq081", "db": "san_francisco_plus", "question": "Find the latest ride data for each region between 2014 and 2017. I want to know the name of each region, the trip ID of this ride, the ride duration, the start time, the starting station, and the gender of the rider.", "external_knowledge": null}
{"instance_id": "sf_bq294", "db": "SAN_FRANCISCO_PLUS", "question": "Could you provide the details of the top 5 longest bike share trips that started between July 1, 2017, and December 31, 2017, including the trip ID, duration in seconds, start date, start station name, route (derived from start station name to end station name), bike number, subscriber type, member's birth year, the member's current age (calculated using the current year), an age classification based on whether the member is younger than 40, between 40 and 60, or older than 60, the member's gender, and the name of the region of the start station? Please exclude any trips where the start station name, member's birth year, or member's gender is not specified.", "temporal": "Yes", "external_knowledge": "trip_info.md"}
{"instance_id": "bq339", "db": "san_francisco_plus", "question": "Which month in 2017 had the largest absolute difference between cumulative bike usage minutes for customers and subscribers?", "external_knowledge": null}
{"instance_id": "bq400", "db": "san_francisco_plus", "question": "For trips where 'Clay St & Drumm St' occurs before 'Sacramento St & Davis St' in the stop sequence (one direction only), what are the earliest departure times from 'Clay St & Drumm St' and the latest arrival times at 'Sacramento St & Davis St' in the format HH:MM:SS? Please provide the trip headsign for each route.", "external_knowledge": null}
{"instance_id": "bq059", "db": "san_francisco_plus", "question": "What is the highest average speed (rounded to 1 decimal, in metric m/s) for bike trips in Berkeley with trip distance greater than 1000 meters?", "external_knowledge": null}
{"instance_id": "bq376", "db": "san_francisco_plus", "question": "For each neighborhood in San Francisco where at least one bike share station and at least one crime incident are located, provide the neighborhood name along with the total count of bike share stations and the total number of crime incidents in that neighborhood.", "external_knowledge": null}
{"instance_id": "sf_bq014", "db": "THELOOK_ECOMMERCE", "question": "Can you help me figure out the revenue for the product category that has the highest number of customers making a purchase in their first non-cancelled and non-returned order?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq188", "db": "THELOOK_ECOMMERCE", "question": "Among all product categories in the dataset, which category has the highest total quantity purchased, and for that category, what is the average time in minutes that users spend on each product page visit, calculated as the average difference between the time the product page is viewed and the time of the next event within the same session", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq258", "db": "THELOOK_ECOMMERCE", "question": "Generate a monthly report for each product category , where each row corresponds to orders that have a status of 'Complete' and were delivered before the year 2022, grouping by the month and year of delivery. For each category, calculate the total revenue (the sum of sale_price), the total number of completed orders, and compute the month-over-month percentage growth for both revenue and orders by comparing each month’s totals to the previous month’s. Then, for the same orders, aggregate and show the total cost (from product costs), total profit (revenue minus total cost), and finally the profit-to-cost ratio for each month.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq259", "db": "THELOOK_ECOMMERCE", "question": "Using data up to the end of 2022 and organized by the month of each user's first purchase, can you provide the percentage of users who made a purchase in each of the first, second, third, and fourth months since their initial purchase, where the \"first month\" refers to the month of their initial purchase?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq189", "db": "THELOOK_ECOMMERCE", "question": "Based solely on completed orders, calculate the average monthly percentage growth rate in the number of unique orders (counting distinct order IDs) for each product category by comparing each month's count to the previous month within the same category. Identify the product category with the highest average of these monthly order growth rates. Then, for that specific product category, compute the average monthly revenue growth rate by calculating the percentage change in total revenue (sum of sale prices) from month to month and averaging these values over the entire period.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq260", "db": "THELOOK_ECOMMERCE", "question": "From January 1, 2019, to April 30, 2022, how many users are at the youngest age and how many users are at the oldest age for each gender in the e-commerce platform, counting both youngest and oldest users separately for each gender?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq261", "db": "THELOOK_ECOMMERCE", "question": "For each month prior to January 2024, identify the product that achieved the highest total profit (calculated as the sum of sale_price minus the product’s cost) across all order items, then report the total cost and total profit for that top product per month, including all order items regardless of their status, and present the results chronologically by month.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq262", "db": "THELOOK_ECOMMERCE", "question": "Generate a monthly analysis report for e-commerce sales from June 2019 to December 2019 that includes, for each product category and each month, the total number of orders, total revenue, and total profit, along with their month-over-month growth rates using the data from June 2019 as the basis for calculating growth starting from July 2019. Ensure that all orders are included regardless of their status, and present the results sorted in ascending order by month (formatted as \"2019-07\") and then by product category. Omitting June 2019 from the final output but using it for the growth calculations.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq190", "db": "THELOOK_ECOMMERCE", "question": "Determine the number of users who are the youngest and oldest for each gender (male and female) separately, among those who signed up between January 1, 2019, and April 30, 2022. For each gender, identify the minimum and maximum ages within this date range, and count how many users fall into these respective age groups.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq263", "db": "THELOOK_ECOMMERCE", "question": "Please create a month-by-month report for the year 2023 that focuses on the 'Sleep & Lounge' category, showing for each month the total sales, total cost, number of complete orders, total profit, and the profit-to-cost ratio, ensuring that the order is marked as 'Complete,' the creation date is between January 1, 2023, and December 31, 2023, and the cost data is accurately associated with the corresponding product through the order items.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq264", "db": "THELOOK_ECOMMERCE", "question": "Identify the difference in the number of the oldest and youngest users registered between January 1, 2019, and April 30, 2022, from our e-commerce platform data.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq197", "db": "THELOOK_ECOMMERCE", "question": "Considering only orders with a status of 'Complete' and products with a non-null brand, identify for each month prior to July 2024 the product that achieved the highest sales volume and revenue, including in the results the product's name, brand, category, total sales, total revenue, month, and order status.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq265", "db": "THELOOK_ECOMMERCE", "question": "Can you list the email addresses of the top 10 users who registered in 2019 and made purchases in 2019, ranking them by their highest average order value, where average order value is calculated by multiplying the number of items in each order by the sale price, summing this total across all orders for each user, and then dividing by the total number of orders?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq266", "db": "THELOOK_ECOMMERCE", "question": "Please provide the names of the products that had sales in each month of 2020 and had the lowest profit, calculated as the difference between their retail price and cost from the products data. Exclude any months where this data isn't available. Please list the products in chronological order based on the month.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq333", "db": "THELOOK_ECOMMERCE", "question": "Which three browsers have the shortest average session duration—calculated by the difference in seconds between the earliest and latest timestamps for each user’s session—while only including browsers that have more than 10 total sessions, and what are their respective average session durations?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq361", "db": "THELOOK_ECOMMERCE", "question": "For the user cohort with a first purchase date in January 2020, what proportion of users returned in the subsequent months of 2020?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq271", "db": "THELOOK_ECOMMERCE", "question": "Please generate a report that, for each month in 2021, provides the number of orders, the number of unique purchasers, and the profit (calculated as the sum of product retail prices minus the sum of product costs), where the orders were placed during 2021 by users who registered in 2021 for inventory items created in 2021, and group the results by the users' country, product department, and product category.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq272", "db": "THELOOK_ECOMMERCE", "question": "Please provide the names of the top three most profitable products for each month from January 2019 through August 2022, excluding any products associated with orders that were canceled or returned. For each product in each month, the profit should be calculated as the sum of the sale prices of all order items minus the sum of the costs of those sold items in that month.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq273", "db": "THELOOK_ECOMMERCE", "question": "Can you list the top 5 months from August 2022 to November 2023 where the profit from Facebook-sourced completed orders showed the largest month-over-month increase? Calculate profit as sales minus costs.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq020", "db": "GENOMICS_CANNABIS", "question": "What is the name of the reference sequence with the highest variant density in the given cannabis genome dataset?", "external_knowledge": null}
{"instance_id": "sf_bq107", "db": "GENOMICS_CANNABIS", "question": "What is the variant density of the cannabis reference with the longest reference length? Pay attention that a variant is present if there is at least one variant call with a genotype greater than 0.", "external_knowledge": null}
{"instance_id": "bq025", "db": "census_bureau_international", "question": "Provide a list of the top 10 countries for the year 2020, ordered by the highest percentage of their population under 20 years old. For each country, include the total population under 20 years old, the total midyear population, and the percentage of the population that is under 20 years old.", "external_knowledge": null}
{"instance_id": "bq115", "db": "census_bureau_international", "question": "Which country has the highest percentage of population under the age of 25 in 2017?", "external_knowledge": null}
{"instance_id": "bq030", "db": "covid19_open_data", "question": "As of May 10, 2020, among all countries that had more than 50,000 confirmed COVID-19 cases, which three countries had the highest recovery rates based on the total number of recovered cases relative to their total confirmed cases, and what were their respective recovery rates expressed as percentages?", "external_knowledge": null}
{"instance_id": "bq018", "db": "covid19_open_data", "question": "Which day in March and April had the highest COVID-19 confirmed case growth rate in the United States? The format is MM-DD.", "external_knowledge": null}
{"instance_id": "bq086", "db": "covid19_open_world_bank", "question": "You need to calculate the percentage of each country's population that had been confirmed with COVID-19 by June 30, 2020. The population data for 2018 can be found in the World Bank dataset, and the cumulative COVID-19 confirmed cases data is available in the COVID-19 Open Data dataset. Calculate the percentage of each country's population, that was cumulatively confirmed to have COVID-19", "external_knowledge": null}
{"instance_id": "bq085", "db": "covid19_jhu_world_bank", "question": "Could you provide, for the United States, France, China, Italy, Spain, Germany, and Iran, the total number of confirmed COVID-19 cases as of April 20, 2020, along with the number of cases per 100,000 people based on their total 2020 populations calculated by summing all relevant population entries from the World Bank data", "external_knowledge": null}
{"instance_id": "bq130", "db": "covid19_nyt", "question": "Analyze daily new COVID-19 case counts from March to May 2020, identifying the top five states by daily increases. Please compile a ranking based on how often each state appears in these daily top fives. Then, examine the state that ranks fourth overall and identify its top five counties based on their frequency of appearing in the daily top five new case counts.", "external_knowledge": null}
{"instance_id": "bq087", "db": "covid19_symptom_search", "question": "Please calculate the overall percentage change in the average weekly search frequency for the symptom 'Anosmia' across the five New York City counties—Bronx County, Queens County, Kings County, New York County, and Richmond County—by comparing the combined data from January 1, 2019, through December 31, 2019, with the combined data from January 1, 2020, through December 31, 2020.", "external_knowledge": null}
{"instance_id": "bq088", "db": "covid19_symptom_search", "question": "Please calculate the average levels of anxiety and depression symptoms from the weekly country data for the United States during the periods from January 1, 2019, to January 1, 2020, and from January 1, 2020, to January 1, 2021. Then, compute the percentage increase in these average symptom levels from the 2019 period to the 2020 period.", "external_knowledge": null}
{"instance_id": "bq089", "db": "covid19_usa", "question": "Given the latest population estimates from the 2018 five-year American Community Survey, what is the number of vaccine sites per 1000 people for counties in California?", "external_knowledge": null}
{"instance_id": "bq407", "db": "covid19_usa", "question": "Find the top three counties with populations over 50,000, using the 2020 5-year census data, that had the highest COVID-19 case fatality rates on August 27, 2020. For these counties, provide the name, state, median age, total population, number of confirmed COVID-19 cases per 100,000 people, number of deaths per 100,000 people, and the case fatality rate as a percentage", "external_knowledge": null}
{"instance_id": "bq137", "db": "census_bureau_usa", "question": "Please find all zip code areas located within 10 kilometers of the coordinates (-122.3321, 47.6062) by joining the 2010 census population data (summing only male and female populations with no age constraints) and the zip code area information, and return each area’s polygon, land and water area in meters, latitude and longitude, state code, state name, city, county, and total population.", "external_knowledge": "functions_st_dwithin.md"}
{"instance_id": "bq060", "db": "census_bureau_international", "question": "Which top 3 countries had the highest net migration in 2017 among those with an area greater than 500 square kilometers? And what are their migration rates?", "external_knowledge": null}
{"instance_id": "bq338", "db": "census_bureau_acs_1", "question": "Can you find the census tracts in the 36047 area that are among the top 20 for the largest percentage increases in population from 2011 to 2018, are also among the top 20 for the largest absolute increases in median income during the same period, and had over 1,000 residents in each of those years?", "external_knowledge": null}
{"instance_id": "bq061", "db": "census_bureau_acs_1", "question": "Which census tract has witnessed the largest increase in median income between 2015 and 2018 in California? Tell me the tract code.", "external_knowledge": null}
{"instance_id": "bq064", "db": "census_bureau_acs_1", "question": "Using the 2017 U.S. Census Tract data from the BigQuery public datasets, you need to proportionally allocate each tract's population and income to the zip codes based on the overlapping area between their geographic boundaries. Then, filter the results to include only those zip codes located within a 5-mile radius of a specific point in Washington State, with coordinates at latitude 47.685833°N and longitude -122.191667°W. Finally, calculate the total population and the average individual income for each zip code (rounded to one decimal place) and sort the results by the average individual income in descending order.", "external_knowledge": "functions_st_intersects_area.md"}
{"instance_id": "bq461", "db": "ncaa_basketball", "question": "Please provide a chronological summary of all scoring plays from the 2014 season game where the Wildcats were the home team and the Fighting Irish were the away team. Include for each scoring event the game clock, cumulative scores for both teams (Wildcats and Fighting Irish), the team that scored, and a description of the event.", "external_knowledge": null}
{"instance_id": "bq198", "db": "ncaa_basketball", "question": "List the top 5 universities with the most seasons where they achieved the maximum wins in their respective NCAA basketball seasons between 1900-2000, showing each team's total number of such peak-performance seasons, while excluding entries with missing team names.", "external_knowledge": null}
{"instance_id": "bq462", "db": "ncaa_basketball", "question": "Please generate a table that lists the top five NCAA basketball records in each of the following four categories: (1) the largest venues by seating capacity (include all venues with Date as 'N/A'), (2) the National Championship games since the 2016 season with the biggest point margin victories, (3) the games since the 2011 season with the highest total points scored by both teams combined, and (4) the games since the 2011 season with the highest total three-pointers made by both teams combined. Organize the results into a table with columns for Category, Date, Matchup or Venue, and Key Metric.", "external_knowledge": null}
{"instance_id": "bq427", "db": "ncaa_basketball", "question": "Could you determine, for each shot type, the average x and y coordinates (adjusted to ensure consistency regarding the left or right basket), the average number of shot attempts, and the average number of successful shots, considering only shots taken before March 15, 2018, excluding those with null shot types or coordinates, ensuring the shots are on the correct side of the court based on the team's basket.", "external_knowledge": "basketball.md"}
{"instance_id": "bq428", "db": "ncaa_basketball", "question": "For the top five team markets with the highest number of distinct players who scored at least 15 points during the second period of games between 2010 and 2018, provide details of each game they played in NCAA basketball historical tournament matches during the same period, as specified in the data model document.", "external_knowledge": "ncaa_data_model.md"}
{"instance_id": "bq144", "db": "ncaa_insights", "question": "Create a dataset by combining NCAA men's basketball tournament game outcomes from the 2014 season onwards, including both the historical tournament games and the 2018 tournament results, with the corresponding pace and efficiency performance metrics for each team and their opponents from the feature_engineering data. The dataset should include the season, game outcome labels (win or loss), team and opponent seeds, school names, pace and efficiency rankings, statistical values, and the differences between the team's and the opponent's metrics to enable a comprehensive analysis of team and opponent dynamics.", "external_knowledge": "NCAA_Basketball_Tournament_SQL_Query_Variable_Guide.md"}
{"instance_id": "bq113", "db": "bls", "question": "Which county in Utah experienced the greatest percentage increase in construction employment from 2000 to 2018, calculated by averaging the employment levels during the third month of each quarter in those years? What is the corresponding percentage increase?", "external_knowledge": null}
{"instance_id": "bq112", "db": "bls", "question": "Between 1998 and 2017, for Allegheny County in the Pittsburgh area, did the average annual wages for all industries keep pace with the inflation of all consumer items, and what were the respective percentage growth rates (to two decimal places) for wages and the CPI over that period?", "external_knowledge": null}
{"instance_id": "bq055", "db": "google_dei", "question": "Can you provide the top three races with the largest percentage differences between Google's 2021 overall hiring data from dar non intersectional hiring and the average percentages in the 2021 BLS data for the technology sectors specifically defined as 'Internet publishing and broadcasting and web search portals,' 'Software publishers,' 'Data processing, hosting, and related services,' or the industry group 'Computer systems design and related services,' along with their respective differences?", "external_knowledge": null}
{"instance_id": "bq075", "db": "google_dei", "question": "Could you provide a combined 2021 report comparing racial (Asian, Black, Hispanic/Latinx, White) and gender (U.S. Women, U.S. Men) distributions across Google’s overall workforce hiring, Google’s overall workforce representation, and the BLS data specifically for the technology sectors defined as Internet publishing and broadcasting and web search portals or Computer systems design and related services?", "external_knowledge": null}
{"instance_id": "bq406", "db": "google_dei", "question": "Please calculate the growth rates for Asians, Black people, Latinx people, Native Americans, White people, US women, US men, global women, and global men from 2014 to 2024 concerning the overall workforce.", "external_knowledge": null}
{"instance_id": "sf_bq084", "db": "GOOG_BLOCKCHAIN", "question": "For each month in the year 2023, how many total transactions occurred (counting all transaction records without removing duplicates of transaction hashes), and how many transactions per second were processed each month, where the transactions-per-second value is calculated by dividing the monthly total count by the exact number of seconds in that month, including the correct leap-year logic if applicable based on the extracted year from the transaction timestamp? Show the monthly transaction count, the computed transactions per second, the year, and the month, and present the rows in descending order of the monthly transaction count.", "external_knowledge": null}
{"instance_id": "sf_bq058", "db": "GOOG_BLOCKCHAIN", "question": "Retrieve all finalized deposits into Optimism at block 29815485 using the Optimism Standard Bridge, including transaction hash, an Etherscan link (the complete URL), L1 and L2 token addresses, sender and receiver addresses (with leading zeroes stripped), and the deposited amount (converted from hex to decimal). Ensure data is properly formatted and parsed according to Optimism's address and token standards, and remove the prefix '0x' except transaction hash. Note that, the keccak-256 hash of the Ethereum event signature for DepositFinalized is \"0x3303facd24627943a92e9dc87cfbb34b15c49b726eec3ad3487c16be9ab8efe8\".", "external_knowledge": "optimism_standard_bridge_contract.md"}
{"instance_id": "sf_bq416", "db": "GOOG_BLOCKCHAIN", "question": "Could you retrieve the top three largest USDT transfers on the TRON blockchain by listing the block numbers, source addresses, destination addresses (in TronLink format), and transfer amounts, using the USDT contract address '0xa614f803b6fd780986a42c78ec9c7f77e6ded13c' and the transfer event signature '0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef', dividing the raw transfer value by 1,000,000 to convert it into the final USDT amount, and then ordering the results by the largest transferred amounts first?", "external_knowledge": "blockchain_data_transformations.md"}
{"instance_id": "sf_bq226", "db": "GOOG_BLOCKCHAIN", "question": "Which sender address, represented as a complete URL on https://cronoscan.com, has been used most frequently on the Cronos blockchain in transactions to non-null 'to_address' fields, within blocks larger than 4096 bytes, since January 1, 2023?", "external_knowledge": null}
{"instance_id": "sf_bq016", "db": "DEPS_DEV_V1", "question": "Considering only the highest release versions of NPM packages, which dependency (package and its version) appears most frequently among the dependencies of these packages?", "external_knowledge": null}
{"instance_id": "sf_bq062", "db": "DEPS_DEV_V1", "question": "What is the most frequently used license by packages in each system?", "external_knowledge": null}
{"instance_id": "sf_bq063", "db": "DEPS_DEV_V1", "question": "Find the GitHub URL (with link label 'SOURCE_REPO') of the latest released version of the NPM package that has the highest number of dependencies in its latest released version, excluding packages whose names contain the character '@' and only considering URLs where the link label is 'SOURCE_REPO' and the URL contains 'github.com'.", "external_knowledge": null}
{"instance_id": "sf_bq028", "db": "DEPS_DEV_V1", "question": "Considering only the latest release versions of NPM package, which packages are the top 8 most popular based on the Github star number, as well as their versions?", "external_knowledge": null}
{"instance_id": "bq022", "db": "chicago", "question": "Calculate the minimum and maximum trip duration in minutes (rounded to the nearest whole number), total number of trips, and average fare for each of six equal quantile groups based on trip duration, considering only trips between 0 and 60 minutes.", "external_knowledge": null}
{"instance_id": "bq362", "db": "chicago", "question": "Which three companies had the largest increase in trip numbers between two consecutive months in 2018?", "external_knowledge": null}
{"instance_id": "bq363", "db": "chicago", "question": "Calculate the total number of trips and average fare (formatted to two decimal places) for ten equal quantile groups of taxi trips based on rounded minute durations between 1-50 minutes, displaying each group's formatted time range (XXm to XXm) sorted chronologically, where quantile groups are created from ordered trip durations and time ranges represent the minimum/maximum values within each quantile partition", "external_knowledge": null}
{"instance_id": "bq076", "db": "chicago", "question": "What is the highest number of motor vehicle theft incidents that occurred in any single month during 2016?", "external_knowledge": null}
{"instance_id": "bq077", "db": "chicago", "question": "For each year from 2010 to 2016, what is the highest number of motor thefts in one month?", "external_knowledge": null}
{"instance_id": "bq350", "db": "open_targets_platform_1", "question": "For the detailed molecule data, Please display the drug id, drug type and withdrawal status for approved drugs with a black box warning and known drug type among 'Keytruda', 'Vioxx', 'Premarin', and 'Humira'", "external_knowledge": null}
{"instance_id": "bq379", "db": "open_targets_platform_1", "question": "Which target approved symbol has the overall association score closest to the mean score for psoriasis?", "external_knowledge": null}
{"instance_id": "bq078", "db": "open_targets_platform_2", "question": "Retrieve the approved symbol of target genes with the highest overall score that are associated with the disease 'EFO_0000676' from the data source 'IMPC'.", "external_knowledge": null}
{"instance_id": "bq095", "db": "open_targets_platform_1", "question": "Generate a list of drugs from the table containing molecular details that have completed clinical trials for pancreatic endocrine carcinoma, disease ID EFO_0007416. Please include each drug's name, the target approved symbol, and links to the relevant clinical trials.", "external_knowledge": null}
{"instance_id": "bq109", "db": "open_targets_genetics_1", "question": "Find the average, variance, max-min difference, and the QTL source(right study) of the maximum log2(h4/h3) for data where right gene id is \"ENSG00000169174\", h4 > 0.8, h3 < 0.02, reported trait includes \"lesterol levels\", right biological feature is \"IPSC\", and the variant is '1_55029009_C_T'.", "external_knowledge": null}
{"instance_id": "sf_bq325", "db": "OPEN_TARGETS_GENETICS_2", "question": "Please identify the top 10 genes with the strongest associations across all studies by first selecting, for each gene within each study, the variant with the lowest p-value, and then ranking all such gene–variant pairs to return the 10 genes with the smallest p-values overall.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "bq090", "db": "CYMBAL_INVESTMENTS", "question": "How much higher the average intrinsic value is for trades using the feeling-lucky strategy compared to those using the momentum strategy under long-side trades?", "external_knowledge": null}
{"instance_id": "bq442", "db": "CYMBAL_INVESTMENTS", "question": "Please collect the information of the top 6 trade report with the highest closing prices. Refer to the document for all the information I want.", "external_knowledge": "Trade_Capture_Report_Data_List.md"}
{"instance_id": "bq079", "db": "usfs_fia", "question": "Considering only the latest evaluation group per state for the 'EXPCURR' evaluation type, determine which state has the highest total acreage of timberland and which has the highest total acreage of forestland. For timberland, include plots where the condition status code is 1, the reserved status code is 0, and the site productivity class code is between 1 and 6. For forestland, include plots where the condition status code is 1. Calculate the total acres by summing the adjusted expansion factors for macroplots and subplots, using their respective proportion bases ('MACR' for macroplots and 'SUBP' for subplots) and adjustment factors when greater than zero. For each category (timberland and forestland), provide the state code, evaluation group, state name, and the total acres for the state with the highest total acreage, considering only the latest evaluation group per state.", "external_knowledge": null}
{"instance_id": "bq024", "db": "usfs_fia", "question": "For the year 2012, which top 10 evaluation groups have the largest subplot acres when considering only the condition with the largest subplot acres within each group? Please include the evaluation group, evaluation type, condition status code, evaluation description, state code, macroplot acres, and subplot acres.", "external_knowledge": null}
{"instance_id": "bq220", "db": "usfs_fia", "question": "Based on the condition, plot_tree, and population tables in bigquery-public-data.usfs_fia, for the evaluation_type set to 'EXPCURR' and condition_status_code equal to 1, which states had the largest average subplot size and the largest average macroplot size, respectively, for each of the years 2015, 2016, and 2017? Please include the type of plot (subplot or macroplot), the specific year, the state, and the corresponding average size in your results.", "external_knowledge": "subplot_macroplot_size.md"}
{"instance_id": "bq096", "db": "gbif", "question": "Determine which year had the earliest date after January on which more than 10 sightings of Sterna paradisaea were recorded north of 40 degrees latitude. For each year, find the first day after January with over 10 sightings of this species in that region, and identify the year whose earliest such date is the earliest among all years.", "external_knowledge": null}
{"instance_id": "sf_bq276", "db": "NOAA_PORTS", "question": "Can you provide a comprehensive list of all ports in region number 6585 that lie within U.S. state boundaries and have been affected by named storms in the North Atlantic basin with wind speeds of at least 35 knots and a Saffir-Simpson classification of at least minimal tropical storm strength, including for each port its name, the state name, the distinct years in which storms occurred, the total count of distinct storms, the distinct storm names, the average storm category, the average wind speed, and the respective geometries for both the port and the tropical storm areas?", "temporal": "Yes", "external_knowledge": "persistent_udfs_routines.md"}
{"instance_id": "bq277", "db": "noaa_ports", "question": "Which single port, listed under region number '6585', is located within a U.S. state boundary and appears most frequently inside the geographic areas of named tropical storms with wind speeds of at least 35 knots in the North Atlantic basin, excluding those labeled 'NOT_NAMED'?", "external_knowledge": "persistent_udfs_routines.md"}
{"instance_id": "bq278", "db": "sunroof_solar", "question": "Please provide a detailed comparison of the solar potential for each state, distinguishing between postal code and census tract levels. For each state, include the total number of buildings available for solar installations, the average percentage of Google Maps area covered by Project Sunroof, the average percentage of that coverage which is suitable for solar, the total potential panel count, the total kilowatt capacity, the energy generation potential, the carbon dioxide offset, the current number of buildings with solar panels, and the gap in potential installations calculated by adjusting the total qualified buildings with the coverage and suitability percentages and subtracting the current installations.", "external_knowledge": null}
{"instance_id": "bq102", "db": "gnomAD", "question": "Identify which start positions are associated with missense variants in the BRCA1 gene on chromosome 17, where the reference base is 'C' and the alternate base is 'T'. Using data from the gnomAD v2.1.1 version.", "external_knowledge": null}
{"instance_id": "bq445", "db": "gnomAD", "question": "Using the gnomAD v2.1.1 genomes data for chromosome 17, determine the smallest start position and largest end position of any variant whose nested VEP annotations contain the symbol 'BRCA1'. Then, for all variants whose positions fall within that gene region, retrieve the 'Protein_position' values only if the 'Consequence' includes 'missense_variant', sort them in ascending order by 'Protein_position', and finally output the first such result.", "external_knowledge": null}
{"instance_id": "bq103", "db": "gnomAD", "question": "Generate summary statistics on genetic variants in the region between positions 55039447 and 55064852 on chromosome 1. This includes the number of variants, the total allele count, the total number of alleles, and distinct gene symbols (using Variant Effect Predictor, VEP, for gene annotation). Additionally, compute the density of mutations by dividing the length of the region by the number of variants. Using data from the gnomAD v3 version.", "external_knowledge": null}
{"instance_id": "sf_bq104", "db": "GOOGLE_TRENDS", "question": "Based on the most recent refresh date, identify the top-ranked rising search term for the week that is exactly one year prior to the latest available week in the dataset.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf_bq411", "db": "GOOGLE_TRENDS", "question": "Please retrieve the top three Google Trends search terms (ranks 1, 2, and 3) from top_terms for each weekday (Monday through Friday) between September 1, 2024, and September 14, 2024, grouped by the refresh_date column and ordered in descending order of refresh_date.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "bq105", "db": "nhtsa_traffic_fatalities_plus", "question": "According to the 2015 and 2016 accident and driver distraction, and excluding cases where the driver’s distraction status is recorded as 'Not Distracted,' 'Unknown if Distracted,' or 'Not Reported,' how many traffic accidents per 100,000 people were caused by driver distraction in each U.S. state for those two years, based on 2010 census population data, and which five states each year had the highest rates?", "external_knowledge": null}
{"instance_id": "bq108", "db": "nhtsa_traffic_fatalities", "question": "Within the 2015 dataset for accidents that occurred from January through August and involved more than one distinct person, what percentage of these accidents had more than one individual with a severe injury (injury severity = 4)", "external_knowledge": null}
{"instance_id": "bq067", "db": "nhtsa_traffic_fatalities", "question": "I want to create a labeled dataset from the National Highway Traffic Safety Administration traffic fatality data that predicts whether a traffic accident involving more than one distinct person results in more than one fatality, where the label is 1 if an accident has more than one person with an injury severity code of 4 (fatal injury) and 0 otherwise. For each accident, include the numeric predictors: state_number, the vehicle body_type, the number_of_drunk_drivers, the day_of_week, the hour_of_crash, and a binary indicator for whether the accident occurred in a work zone (1 if it is not “None,” otherwise 0). Also, engineer a feature for the average absolute difference between travel_speed and speed_limit per accident, only considering travel speeds up to 151 mph (excluding codes 997, 998, 999) and speed limits up to 80 mph (excluding codes 98, 99), and categorize this average speed difference into levels from 0 to 4 in 20 mph increments with lower bounds inclusive and upper bounds exclusive. Finally, only include accidents that involve more than one distinct person.", "external_knowledge": "nhtsa_traffic_fatalities.md"}
{"instance_id": "bq396", "db": "nhtsa_traffic_fatalities", "question": "Which top 3 states had the largest differences in the number of traffic accidents between rainy and clear weather during weekends in 2016? Please also provide the respective differences for each state.", "external_knowledge": null}
{"instance_id": "bq441", "db": "nhtsa_traffic_fatalities", "question": "Please help me compile the critical details on traffic accidents in 2015, as listed in the info document.", "external_knowledge": "Traffic_Fatalities_Info_List_2015.md"}
{"instance_id": "bq097", "db": "sdoh", "question": "What is the increasing amount of the average earnings per job between the years 2012 and 2017 for each geographic region in Massachusetts (indicated by \"MA\" at the end of GeoName)?", "external_knowledge": null}
{"instance_id": "bq120", "db": "sdoh", "question": "Identify the top 10 regions (counties) with the highest total number of SNAP-participating households, using the 2017 5-year ACS county-level data and SNAP enrollment data from January 1, 2017, excluding regions where the total SNAP participation is zero. For each of these regions, calculate the ratio of households earning under $20,000 to the total number of SNAP-participating households.", "external_knowledge": null}
{"instance_id": "bq110", "db": "sdoh", "question": "What is the change in the number of homeless veterans between 2012 and 2018 for each CoC region in New York that has data available in both years?", "external_knowledge": null}
{"instance_id": "bq395", "db": "sdoh", "question": "Calculate the percentage change in the total number of unsheltered homeless people from 2015 to 2018 for each state by summing the counts over all Continuums of Care (CoCs) within each state. Then, determine the national average of these state percentage changes. Identify the five states whose percentage change is closest to this national average percentage change. Please provide the state abbreviations.", "external_knowledge": null}
{"instance_id": "bq352", "db": "sdoh", "question": "Please list the average number of prenatal weeks in 2018 for counties in Wisconsin where more than 5% of the employed population had commutes of 45-59 minutes in 2017.", "external_knowledge": null}
{"instance_id": "bq074", "db": "sdoh", "question": "Count the number of counties that experienced an increase in unemployment from 2015 to 2018, using 5-year ACS data, and a decrease in dual-eligible enrollee counts between December 1, 2015, and December 1, 2018.", "external_knowledge": null}
{"instance_id": "bq066", "db": "sdoh", "question": "Could you assess the relationship between the poverty rates from the previous year's census data and the percentage of births without maternal morbidity for the years 2016 to 2018? Use only data for births where no maternal morbidity was reported and for each year, use the 5-year census data from the year before to compute the Pearson correlation coefficient", "external_knowledge": null}
{"instance_id": "bq114", "db": "openaq", "question": "Which three cities have the largest difference between their 1990 EPA PM2.5 measurements (using units_of_measure = 'Micrograms/cubic meter (LC)' and parameter_name = 'Acceptable PM2.5 AQI & Speciation Mass') and their 2020 OpenAQ PM2.5 measurements (where pollutant = 'pm25' based on the year extracted from the timestamp), with both datasets matched by latitude and longitude rounded to two decimals, and the difference ordered from greatest to least?", "external_knowledge": null}
{"instance_id": "bq116", "db": "sec_quarterly_financials", "question": "Which U.S. state reported the highest total annual revenue in billions of dollars during fiscal year 2016, considering companies that provided four quarters of data and reported measure tags in ('Revenues','SalesRevenueNet','SalesRevenueGoodsNet'), excluding any entries where the state field (stprba) is null or empty?", "external_knowledge": null}
{"instance_id": "bq015", "db": "stackoverflow_plus", "question": "Identify and rank the top 10 tags from Stack Overflow questions that were referenced in Hacker News comments on or after 2014 by counting how many times each question was mentioned, then splitting the questions’ tag strings by the '|' delimiter, grouping by tag", "external_knowledge": null}
{"instance_id": "bq041", "db": "stackoverflow", "question": "Compute the monthly statistics for new StackOverflow users created in 2021. For each month, report the total number of new users, the percentage of these new users who asked at least one question within 30 days of signing up, and among those who asked a question within 30 days, the percentage who then answered at least one question after their first question and within 30 days following their first question.", "external_knowledge": null}
{"instance_id": "sf_bq121", "db": "STACKOVERFLOW", "question": "How do the average reputation and number of badges vary among Stack Overflow users based on the number of complete years they have been members, considering only those who joined on or before October 1, 2021?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "bq123", "db": "stackoverflow", "question": "You need to determine which day of the week has the third highest percentage of questions on Stack Overflow that receive an answer within an hour. To do this, use the question creation date from the posts_questions table and the earliest answer creation date from the posts_answers table. Once you’ve calculated the percentage of questions that get answered within an hour for each day, identify the day with the third highest percentage and report that percentage.", "external_knowledge": null}
{"instance_id": "bq280", "db": "stackoverflow", "question": "Please provide the display name of the user who has answered the most questions on Stack Overflow, considering only users with a reputation greater than 10.", "external_knowledge": null}
{"instance_id": "bq300", "db": "stackoverflow", "question": "What is the highest number of answers received for a single Python 2 specific question on Stack Overflow, excluding any discussions that involve Python 3?", "external_knowledge": null}
{"instance_id": "bq301", "db": "stackoverflow", "question": "Retrieve details of accepted answers to Stack Overflow questions posted in January 2016 that have tags including \"javascript\" and at least one of \"xss\", \"cross-site\", \"exploit\", or \"cybersecurity\"; the answers themselves must also have been posted in January 2016. For each accepted answer, include the answer's ID, the answerer's reputation, score, and comment count, along with the associated question's tags, score, answer count, the asker's reputation, view count, and comment count.", "external_knowledge": null}
{"instance_id": "bq302", "db": "stackoverflow", "question": "What is the monthly proportion of Stack Overflow questions tagged with 'python' in the year 2022?", "external_knowledge": null}
{"instance_id": "bq303", "db": "stackoverflow", "question": "From July 1, 2019 through December 31, 2019, for all users with IDs between 16712208 and 18712208 on Stack Overflow, retrieve the user ID and the tags of the relevant question for each of their contributions, including comments on both questions and answers, any answers they posted, and any questions they authored, making sure to correctly associate the comment or answer with its parent question’s tags.", "external_knowledge": null}
{"instance_id": "bq304", "db": "stackoverflow", "question": "Retrieve the top 50 most viewed questions for each of the following Android-related tags on StackOverflow: 'android-layout', 'android-activity', 'android-intent', 'android-edittext', 'android-fragments', 'android-recyclerview', 'listview', 'android-actionbar', 'google-maps', and 'android-asynctask'. Each question must contain the word 'how' in either its title or body and must not contain any of the following troubleshooting terms in either its title or body: 'fail', 'problem', 'error', 'wrong', 'fix', 'bug', 'issue', 'solve', or 'trouble'. Only include tags that have at least 50 questions meeting these criteria, and for each such tag, select the top 50 questions ranked by view count.", "external_knowledge": null}
{"instance_id": "bq310", "db": "stackoverflow", "question": "What is the title of the most viewed \"how\" question related to Android development on StackOverflow, across specified tags such as 'android-layout', 'android-activity', 'android-intent', and others", "external_knowledge": null}
{"instance_id": "bq305", "db": "stackoverflow", "question": "Which 10 users have the highest combined view counts for questions they are associated with, where a user is considered associated if they own the question, or their answer is the accepted answer, or their answer's score is greater than 5, or their answer's score exceeds 20% of the total answer scores for that question (and is above 0), or their answer is among the top three highest-scoring answers for that question?", "external_knowledge": null}
{"instance_id": "bq306", "db": "stackoverflow", "question": "Identify the top 10 tags for user 1908967, based only on answers posted before June 7, 2018, where each tag’s score is 10 times the number of upvotes (vote_type_id=2) and 15 times the number of accepted answers (vote_type_id=1). Derive tags from the questions associated with those answers, and consider only the upvotes and accepted answers for those answers. Return the tags with the highest total scores in descending order, limited to 10 tags.", "external_knowledge": null}
{"instance_id": "sf_bq307", "db": "STACKOVERFLOW", "question": "Find the top 10 gold badges that users most commonly earn as their first gold badge on Stack Overflow. For each of these badges, display the badge name, the number of users who earned it as their first gold badge, and the average number of days from the user's account creation date to the date they earned the badge, calculated in days without any adjustments for date formats.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "bq308", "db": "stackoverflow", "question": "Show the number of Stack Overflow questions asked each day of the week in 2021, and find out how many and what percentage of those were answered within one hour.", "external_knowledge": null}
{"instance_id": "bq309", "db": "stackoverflow", "question": "Retrieve the top 10 longest questions on Stack Overflow, measured by the length of their body text, where each question either has an accepted answer or has no accepted answer but has at least one answer with a score-to-view ratio exceeding 0.01. For each of these questions, include the reputation of the user who asked the question, the user's net votes (calculated as their total up_votes minus down_votes), and the total number of badges the user has earned.", "external_knowledge": null}
{"instance_id": "bq124", "db": "fhir_synthea", "question": "Among all patients, how many individuals remain alive (i.e., with no recorded deceased.dateTime), have a diagnosis of either Diabetes or Hypertension, and are prescribed at least seven distinct active medications?", "external_knowledge": null}
{"instance_id": "bq391", "db": "fhir_synthea", "question": "Among living patients whose last names begin with \"A\" and who each have exactly one distinct condition, which eight conditions have the highest number of different active medications prescribed to any single patient, and what are their corresponding codes?", "external_knowledge": null}
{"instance_id": "bq126", "db": "the_met", "question": "What are the titles, artist names, mediums, and original image URLs of objects with 'Photograph' in their names from the 'Photographs' department, created not by an unknown artist, with an object end date of 1839 or earlier?", "external_knowledge": null}
{"instance_id": "bq366", "db": "the_met", "question": "What are the top three most frequently associated labels with artworks from each historical period in The Met's collection, only considering labels linked to 500 or more artworks? Provide me with the period, label, and the associated count.", "external_knowledge": null}
{"instance_id": "bq414", "db": "the_met", "question": "Retrieve the object id, title, and the formatted metadata date (as a string in 'YYYY-MM-DD' format) for objects in the \"The Libraries\" department where the cropConfidence is greater than 0.5, the object's title contains the word \"book\".", "external_knowledge": null}
{"instance_id": "bq200", "db": "mlb", "question": "Using data from both the regular season and the post-season, identify the pitcher who achieved the highest non-zero pitch speed for each team by confirming whether the pitcher’s ID appears in the relevant home or away player lists for that game, then retrieve that pitcher’s full name along with the maximum valid pitch speed they achieved while playing for that specific team.", "external_knowledge": null}
{"instance_id": "sf_bq458", "db": "WORD_VECTORS_US", "question": "Tokenize the body text of each article into words, excluding stop words, and obtain the corresponding word vectors for these words from the glove vector. For each word, weight its word vector by dividing each component by the 0.4th power of the word's frequency from the word frequencies. Then, for each article, aggregate these weighted word vectors by summing their components to form an article vector. Normalize each article vector to unit length by dividing by its magnitude. Finally, retrieve the ID, date, title, and the normalized article vector for each article.", "external_knowledge": null}
{"instance_id": "sf_bq459", "db": "WORD_VECTORS_US", "question": "Please find the top 10 most relevant articles by only processing each article’s 'body' field, where each body is tokenized with no stopwords, each remaining token is turned into a GloVe-based word vector and weighted by dividing each dimension by the 0.4th power of its word frequency, then these weighted vectors are summed and normalized to get a unit vector for each article. Perform the same weighting and normalization on the query phrase 'Epigenetics and cerebral organoids: promising directions in autism spectrum disorders' and compute the cosine similarity between the query vector and each article vector. Finally, return the id, date, title, and the cosine similarity score for the top 10 articles with the highest similarity.", "external_knowledge": "tokenize_func.md"}
{"instance_id": "sf_bq460", "db": "WORD_VECTORS_US", "question": "Please process the articles from the 'nature' dataset by first tokenizing the body text into words and removing stopwords. For each remaining word, retrieve its word vector from the glove_vectors table and its frequency from the word_frequencies table, then divide each word vector by the 0.4th power of the word's frequency to weight it. Sum the weighted vectors to obtain an aggregate vector for each article, normalize this aggregate vector to unit length, and then compute the cosine similarity scores between these normalized vectors. Finally, return the IDs, dates, titles, and cosine similarity scores of the top 10 articles most similar to the article with the ID '8a78ef2d-d5f7-4d2d-9b47-5adb25cbd373'.", "external_knowledge": null}
{"instance_id": "bq204", "db": "eclipse_megamovie", "question": "Find the user with the highest total clicks across all records from all available photo collections.", "external_knowledge": null}
{"instance_id": "bq389", "db": "epa_historical_air_quality", "question": "Please calculate the monthly average levels of PM10, PM2.5 FRM, PM2.5 non-FRM, volatile organic emissions, SO2 (scaled by a factor of 10), and Lead (scaled by a factor of 100) air pollutants in California for the year 2020.", "external_knowledge": null}
{"instance_id": "sf_bq345", "db": "IDC", "question": "How large are the DICOM image files with SEG or RTSTRUCT modalities and the SOP Class UID \"1.2.840.10008.5.1.4.1.1.66.4\", when grouped by collection, study, and series IDs, if they have no references to other series, images, or sources? Can you also provide a viewer URL formatted as \"https://viewer.imaging.datacommons.cancer.gov/viewer/\" followed by the study ID, and list these sizes in kilobytes, sorted from largest to smallest?", "external_knowledge": null}
{"instance_id": "sf_bq346", "db": "IDC", "question": "In publicly accessible DICOM data where the Modality is 'SEG' and the SOPClassUID is '1.2.840.10008.5.1.4.1.1.66.4', and each segmentation references its original SOPInstanceUID, which five segmentation categories (by 'SegmentedPropertyCategory.CodeMeaning') occur most frequently?", "external_knowledge": null}
{"instance_id": "sf_bq347", "db": "IDC", "question": "From the union of the specified MR series with SeriesInstanceUID 1.3.6.1.4.1.14519.5.2.1.3671.4754.105976129314091491952445656147 and all associated segmentation instances, which modality has the greatest number of SOP instances in total, and how many are there?", "external_knowledge": null}
{"instance_id": "sf_bq390", "db": "IDC", "question": "In the \"qin_prostate_repeatability\" collection, please provide the distinct StudyInstanceUIDs for studies that include T2-weighted axial MR imaging and also contain anatomical structure segmentations labeled as \"Peripheral zone.\"", "external_knowledge": null}
{"instance_id": "sf_bq421", "db": "IDC", "question": "Can you list all unique pairs of embedding medium and staining substance code meanings, along with the number of occurrences for each pair, based on distinct embedding medium and staining substance codes from the 'SM' modality in the DICOM dataset's un-nested specimen preparation sequences, ensuring that the codes are from the SCT coding scheme?", "external_knowledge": null}
{"instance_id": "sf_bq422", "db": "IDC", "question": "Considering only CT images from the 'nlst' collection, what are the average series sizes in MiB for the top 3 patients with the highest slice interval difference tolerance (calculated as the difference between the maximum and minimum unique slice intervals within their series) and the top 3 patients with the highest maximum exposure difference (calculated as the difference between the maximum and minimum unique exposure values within their series), where the series size is determined by summing the instance sizes of all images in a series and converting it to MiB?", "external_knowledge": null}
{"instance_id": "sf_bq069", "db": "IDC", "question": "Could you help me generate a report of CT image series from the dicom_all table such that all series from the NLST collection are excluded, any localizers or JPEG-compressed series (transfer syntaxes 1.2.840.10008.1.2.4.70 or 1.2.840.10008.1.2.4.51) are skipped, and only those passing certain geometry checks—namely a single orientation, identical pixel spacing, matching SOP instance and position counts, uniform pixel rows and columns, and a near-unity dot product of image orientation vectors—are included, while also computing slice interval differences, exposure differences, and approximate series size in MB for each qualified series?", "external_knowledge": "nonNlstCohort.md"}
{"instance_id": "sf_bq219", "db": "IOWA_LIQUOR_SALES", "question": "In the Iowa Liquor Sales dataset, starting from January 1, 2022 through the last fully completed month, which two liquor categories, each contributing an average of at least 1% to the monthly sales volume over at least 24 months of available data, have the lowest Pearson correlation coefficient when comparing their monthly percentages of total liquor sales across those months, and what are their names?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "bq199", "db": "iowa_liquor_sales", "question": "Identify the top 10 liquor categories in Iowa in 2021 by calculating, for each category, the average of the per-liter retail prices across all sales transactions in that category during 2021. For these top categories, provide their average per-liter retail prices calculated in the same manner for the years 2019, 2020, and 2021.", "external_knowledge": null}
{"instance_id": "bq218", "db": "iowa_liquor_sales", "question": "What are the top 5 items with the highest year-over-year growth percentage in total sales revenue for the year 2023?", "external_knowledge": null}
{"instance_id": "bq049", "db": "iowa_liquor_sales_plus", "question": "Please show the monthly per capita Bourbon Whiskey sales during 2022 in Dubuque County for the zip code that ranks third in total Bourbon Whiskey sales, using only the population aged 21 and older.", "external_knowledge": null}
{"instance_id": "bq360", "db": "nppes", "question": "Among healthcare providers whose practice location is in Mountain View, CA, and who have a specified specialization in the field healthcare provider taxonomy, identify the top 10 most common specializations based on the count of distinct NPIs. Then determine which of those top 10 has a count of distinct NPIs closest to the average count across those 10 specializations.", "external_knowledge": null}
{"instance_id": "bq286", "db": "usa_names", "question": "Can you tell me the name of the most popular female baby in Wyoming for the year 2021, based on the proportion of female babies given that name compared to the total number of female babies given the same name across all states?", "external_knowledge": null}
{"instance_id": "sf_bq044", "db": "TCGA", "question": "For bladder cancer patients who have mutations in the CDKN2A (cyclin-dependent kinase inhibitor 2A) gene, using clinical data from the Genomic Data Commons Release 39, what types of mutations are they, what is their gender, vital status, and days to death - and for four downstream genes (MDM2 (MDM2 proto-oncogene), TP53 (tumor protein p53), CDKN1A (cyclin-dependent kinase inhibitor 1A), and CCNE1 (Cyclin E1)), what are the gene expression levels for each patient?", "external_knowledge": "TCGA_Study_Abbreviations.md"}
{"instance_id": "sf_bq043", "db": "TCGA", "question": "What are the RNA expression levels of the genes MDM2, TP53, CDKN1A, and CCNE1, along with associated clinical information, in bladder cancer patients with CDKN2A mutations in the 'TCGA-BLCA' project? Use clinical data from the Genomic Data Commons Release 39, data about somatic mutations derived from the hg19 human genome reference in Feb 2017.", "external_knowledge": null}
{"instance_id": "bq143", "db": "CPTAC_PDC", "question": "Use CPTAC proteomics and RNAseq data for Clear Cell Renal Cell Carcinoma to select 'Primary Tumor' and 'Solid Tissue Normal' samples. Join the datasets on sample submitter IDs and gene symbols. Calculate the correlation between protein abundance (log2 ratio) and gene expression levels (log-transformed+1 FPKM) for each gene and sample type. Filter out correlations with an absolute value greater than 0.5, and compute the average correlation for each sample type.", "external_knowledge": null}
{"instance_id": "sf_bq147", "db": "TCGA", "question": "Can you identify the TCGA breast cancer cases from the RNA sequencing hg38 r35` where the protein_coding gene and the project TCGA-BRCA, and which have RNA sequencing samples of multiple tissue types—including \"Solid Tissue Normal\"—within the same case?", "external_knowledge": null}
{"instance_id": "sf_bq148", "db": "TCGA", "question": "Could you list the top five protein-coding genes with the highest variance in expression levels, measured as `fpkm_uq_unstranded`, specifically in 'Solid Tissue Normal' samples from TCGA-BRCA cases that include 'Solid Tissue Normal' among their sample types?", "external_knowledge": null}
{"instance_id": "sf_bq175", "db": "TCGA_MITELMAN", "question": "Identify cytoband names on chromosome 1 in the TCGA-KIRC segment allelic dataset where the frequency of amplifications, gains, and heterozygous deletions each rank within the top 11. Calculate these rankings based on the maximum copy number observed across various genomic studies of kidney cancer, reflecting the severity of genetic alterations.", "external_knowledge": "Comprehensive_Guide_to_Copy_Number_Variations_in_Cancer_Genomics.md"}
{"instance_id": "sf_bq176", "db": "TCGA_MITELMAN", "question": "Identify the case barcodes from the TCGA-LAML study with the highest weighted average copy number in cytoband 15q11 on chromosome 15, using segment data and cytoband overlaps from TCGA's genomic and Mitelman databases.", "external_knowledge": null}
{"instance_id": "sf_bq170", "db": "TCGA_MITELMAN", "question": "For breast cancer cases (TCGA-BRCA) from Release 23 of the active GDC archive, identify and categorize copy number variations (CNVs) across all cytobands on every chromosome. For each cytoband and each case, determine the overlap between the cytoband region and the case's copy number segments, and compute the overlap-weighted average copy number for that cytoband in the case, rounding to the nearest whole number. Classify the rounded copy number into CNV types as follows: homozygous deletions (0), heterozygous deletions (1), normal diploid state (2), gains (3), and amplifications (greater than 3). For each cytoband, provide its name and start/end positions, and calculate the frequency of each CNV type across all cases as a percentage of the total number of cases, rounded to two decimal places.", "external_knowledge": "copy_number_variations.md"}
{"instance_id": "sf_bq150", "db": "TCGA_HG19_DATA_V0", "question": "Assess whether different genetic variants affect the log10-transformed TP53 expression levels in TCGA-BRCA samples using sequencing and mutation data. Provide the total number of samples, the number of mutation types, the mean square between groups, the mean square within groups, and the F-statistic.", "external_knowledge": "TCGA_F_Score.md"}
{"instance_id": "sf_bq152", "db": "TCGA_HG38_DATA_V0", "question": "For breast cancer cases (TCGA-BRCA) from Release 23 of the active GDC archive, identify and categorize copy number variations (CNVs) across all cytobands on every chromosome. For each cytoband and each case, determine the overlap between the cytoband region and the case's copy number segments, and compute the overlap-weighted average copy number for that cytoband in the case, rounding to the nearest whole number. Classify the rounded copy number into CNV types as follows: homozygous deletions (0), heterozygous deletions (1), normal diploid state (2), gains (3), and amplifications (greater than 3). For each cytoband, provide its name and start/end positions, and calculate the frequency of each CNV type across all cases as a percentage of the total number of cases, rounded to two decimal places.", "external_knowledge": null}
{"instance_id": "sf_bq155", "db": "TCGA_HG38_DATA_V0", "question": "In the TCGA-BRCA cohort of patients who are 80 years old or younger at diagnosis and have a pathological stage of Stage I, Stage II, or Stage IIA, calculate the t-statistic derived from the Pearson correlation between the log10-transformed average RNA-Seq expression levels (using HTSeq__Counts + 1) of the gene SNORA31 and the average microRNA-Seq expression levels of all unique microRNAs, only considering pairs with more than 25 samples and where the absolute Pearson correlation coefficient is between 0.3 and 1.0", "external_knowledge": null}
{"instance_id": "sf_bq141", "db": "TCGA_HG38_DATA_V0", "question": "Using the TCGA-KIRP dataset, select patients from the 'TCGA_bioclin_v0.Clinical' table who have a non-null clinical_stage and a disease_code of 'KIRP.' Retrieve their gene expression data from the 'TCGA_hg38_data_v0.RNAseq_Gene_Expression' table for the genes 'MT-CO3,' 'MT-CO1,' and 'MT-CO2,' and randomly split the patients into a training set (90%) and a test set (10%) based on their case_barcode via the FARM_FINGERPRINT method. For each clinical stage in the training set, calculate the average HTSeq__FPKM_UQ expression of the three genes. For each patient in the test set, compute the Euclidean distance between the patient’s expression values and the stage-specific averages, and assign that patient to the clinical stage whose average is closest. Finally, output the case_barcode and the predicted clinical stage.", "external_knowledge": null}
{"instance_id": "bq046", "db": "TCGA_bioclin_v0", "question": "Find case barcodes and their corresponding GDC file URLs for female patients aged 30 or younger diagnosed with breast cancer, whose clinical history includes problematic prior treatments for other cancers or redacted annotations. Only consider relevant clinical and annotation data from TCGA with GDC archive release 14.", "external_knowledge": null}
{"instance_id": "sf_bq153", "db": "PANCANCER_ATLAS_1", "question": "Calculate, for each histology type specified in the 'icd_o_3_histology' field (excluding those enclosed in square brackets), the average of the per-patient average log10(normalized_count + 1) expression levels of the IGF2 gene among LGG patients with valid IGF2 expression data. Match gene expression and clinical data using the ParticipantBarcode field.", "external_knowledge": null}
{"instance_id": "sf_bq154", "db": "PANCANCER_ATLAS_1", "question": "Calculate the Kruskal-Wallis H-score among groups of LGG patients for IGF2 gene expression, where each patient’s IGF2 expression is determined by applying log10(normalized_count + 1) and then averaging across samples. Group the patients by ICD-O-3 histology codes, exclude any codes fully enclosed in square brackets, only include groups with more than one patient, and ensure that normalized count is not null. Finally, return the total number of groups, the total number of samples, and the Kruskal-Wallis H-score in descending order.", "external_knowledge": "Regulome_Explorer_Kruskal-Wallis_test_for_numerical_and_categorical_data.md"}
{"instance_id": "sf_bq156", "db": "PANCANCER_ATLAS_1", "question": "Compute the t-score (rounded to 2 decimals) to compare the difference in mean expression levels of the gene DRG2 between two groups (TP53 mutated vs. non-mutated) in the Lower Grade Glioma (LGG) study, where the expression levels are calculated as the average of log10(normalized_count + 1) for each participant, only considering samples with TP53 mutations that have a 'FILTER' status of 'PASS' in the mutation data, and ignoring any groups with fewer than 10 samples or with zero variance; refer to `t_score.md` for the method of computing the t-score.", "external_knowledge": "t_score.md"}
{"instance_id": "sf_bq157", "db": "PANCANCER_ATLAS_1", "question": "Please compute the T-score to determine the statistical difference in the expression of the DRG2 gene between LGG patients with and without TP53 mutation: for each patient, calculate the average of log10(normalized_count + 1) of DRG2 expression across all their samples, using only samples present in the `MC3_MAF_V5_one_per_tumor_sample` table for the LGG study; identify patients with TP53 mutations from this table where `Hugo_Symbol` is 'TP53' and `FILTER` is 'PASS'; then perform a T-test comparing the mean averaged log-transformed DRG2 expression between patients with and without TP53 mutation.", "external_knowledge": "Regulome_Explorer_T_test_for_numerical_and_binary_data.md"}
{"instance_id": "sf_bq158", "db": "PANCANCER_ATLAS_1", "question": "Which top five histological types of breast cancer (BRCA) in the PanCancer Atlas exhibit the highest percentage of CDH1 gene mutations?", "external_knowledge": null}
{"instance_id": "sf_bq159", "db": "PANCANCER_ATLAS_1", "question": "Calculate the chi-square value to assess the association between histological types and the presence of CDH1 gene mutations in BRCA patients using data from the PanCancer Atlas. Focus on patients with known histological types and consider only reliable mutation entries. Exclude any histological types or mutation statuses with marginal totals less than or equal to 10. Match clinical and mutation data using ParticipantBarcode", "external_knowledge": null}
{"instance_id": "bq161", "db": "pancancer_atlas_2", "question": "Calculate the net difference between the number of pancreatic adenocarcinoma (PAAD) patients in TCGA's dataset who are confirmed to have mutations in both KRAS and TP53 genes, and those without mutations in either gene. Utilize patient clinical and follow-up data alongside genomic mutation details from TCGA’s cancer genomics database, focusing specifically on PAAD studies where the mutations have passed quality filters.", "external_knowledge": null}
{"instance_id": "bq151", "db": "pancancer_atlas_2", "question": "Using TCGA dataset, calculate the chi-squared statistic to evaluate the association between KRAS and TP53 gene mutations in patients diagnosed with pancreatic adenocarcinoma (PAAD). Incorporate clinical follow-up data and high-quality mutation annotations to accurately determine the frequency of patients with co-occurring KRAS and TP53 mutations compared to those with each mutation occurring independently. Ensure that patient records are meticulously matched based on unique identifiers to maintain data integrity. This analysis aims to identify and quantify potential correlations between KRAS and TP53 genetic alterations within the PAAD patient population.", "external_knowledge": null}
{"instance_id": "bq162", "db": "HTAN_1", "question": "Based on the 5th revision (r5) of the HTAN data, list the imaging assay types available at the HTAN WUSTL center that have Level2 data and any associated higher-level data (Level3, Level4) derived from them through 'entityId' relationships in the 'id_provenance_r5' table; exclude any records where the 'Component' is NULL or contains 'Auxiliary' or 'OtherAssay'; for each imaging assay type, provide the available data levels (Level2, Level3, Level4), and do not include Level1 data or Electron Microscopy assay types.", "external_knowledge": null}
{"instance_id": "sf_bq163", "db": "HTAN_2", "question": "Which 20 genes exhibit the greatest difference in their average X_value expression between male and female epithelial cells, specifically in cluster 41 of MSK-SCLC patients at the 74-year-old human stage, comparing the female and male groups and ordering results by descending difference?", "external_knowledge": null}
{"instance_id": "sf_bq164", "db": "HTAN_2", "question": "Consolidate metadata from spatial transcriptomics and scRNAseq datasets—including levels 1 through 4 and auxiliary files—for the run ID 'HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test'. Include Filename, HTAN Parent Biospecimen ID, Component, File Format, Entity ID, and Run ID.", "external_knowledge": null}
{"instance_id": "sf_bq166", "db": "TCGA_MITELMAN", "question": "Using segment-level copy number data from the copy_number_segment_allelic_hg38_gdc_r23 dataset restricted to 'TCGA-KIRC' samples, merge these segments with the cytogenetic band definitions in 'CytoBands_hg38' to identify each sample’s maximum copy number per cytoband. Classify these maximum copy numbers into amplifications (>3), gains (=3), homozygous deletions (=0), heterozygous deletions (=1), or normal (=2), then calculate the frequency of each subtype out of the total number of distinct cases, and finally present these frequencies as percentages sorted by chromosome and cytoband.", "external_knowledge": "Comprehensive_Guide_to_Copy_Number_Variations_in_Cancer_Genomics.md"}
{"instance_id": "bq165", "db": "mitelman", "question": "Can you use CytoConverter genomic coordinates to calculate the frequency of chromosomal gains and losses across a cohort of breast cancer (morphology='3111') and adenocarcinoma (topology='0401') samples? Concretely, please include the number and frequency (2 decimals in percentage) of amplifications (gains of more than 1 copy), gains (1 extra copy), losses (1 copy) and homozygous deletions (loss of 2 copies) for each chromosomal band. And sort the result by the ordinal of each chromosome and the starting-ending base-pair position of each band in ascending order.", "external_knowledge": null}
{"instance_id": "bq169", "db": "mitelman", "question": "Retrieve distinct case and clone information for which, within the same clone, there is simultaneously a loss of genetic material on chromosome 13 between positions 48,303,751 and 48,481,890, a loss on chromosome 17 between positions 7,668,421 and 7,687,490, and a gain on chromosome 11 between positions 108,223,067 and 108,369,102. For each such clone, also return the chromosomal details for each of these regions (including chromosome number, start and end positions) and the corresponding karyotype information.", "external_knowledge": null}
{"instance_id": "bq111", "db": "mitelman", "question": "Could you compute, by chromosome, the Pearson correlation between the frequency of copy number aberrations (including amplifications, gains, losses, and deletions) from the Mitelman database for cases with morph = 3111 and topo = 0401, and those computed from TCGA data, returning correlation coefficients and corresponding p-values for each aberration type, ensuring only results with at least five matching records are shown.", "external_knowledge": "Correlations_between_Mitelman_and_TCGA_datasets.md"}
{"instance_id": "bq451", "db": "_1000_genomes", "question": "Extract genotype data for single nucleotide polymorphisms (SNPs) on chromosome X, excluding positions where the `start` value is between 59999 and 2699519 or between 154931042 and 155260559. For each sample, identify genotype calls where the genotype array has at least one allele. Classify each genotype call into one of the following categories: homozygous reference alleles (both alleles are 0), homozygous alternate alleles (both alleles are the same and greater than 0), or heterozygous alleles (alleles are different, or any allele is null, and at least one allele is greater than 0). Compute the total number of callable sites (the sum of all three genotype categories), the number of homozygous reference, homozygous alternate, and heterozygous genotype calls, the total number of single nucleotide variants (SNVs) as the sum of homozygous alternate and heterozygous genotype calls, the percentage of heterozygous genotype calls among all SNVs, and the percentage of homozygous alternate genotype calls among all SNVs. Output the sample ID along with these computed counts and percentages, and order the results by the percentage of heterozygous genotype calls among SNVs in descending order, then by sample ID.", "external_knowledge": "1000_genomes_alleles_type.md"}
{"instance_id": "bq452", "db": "_1000_genomes", "question": "Identify variants on chromosome 12 and, for each variant, calculate the chi-squared score using allele counts in cases and controls, where cases are individuals from the 'EAS' super population and controls are individuals from all other super populations. Apply Yates's correction for continuity in the chi-squared calculation, ensuring that the expected counts for each allele in both groups are at least 5. Return the start position, end position, and chi-squared score of the top variants where the chi-squared score is no less than 29.71679.", "external_knowledge": null}
{"instance_id": "bq453", "db": "_1000_genomes", "question": "In chromosome 17 between positions 41196311 and 41277499, what are the reference names, start and end positions, reference bases, distinct alternate bases, variant types, and the chi-squared scores (calculated from Hardy-Weinberg equilibrium) along with the total number of genotypes, their observed and expected counts for homozygous reference, heterozygous, and homozygous alternate genotypes, as well as allele frequencies (including those from 1KG), for each variant?", "external_knowledge": null}
{"instance_id": "bq454", "db": "_1000_genomes", "question": "Determine, for each super population, the number of common autosomal variants (with an allele frequency of at least 0.05) grouped by the number of samples within that super population that possess the variant. Include the total population size for each super population, the variant types, and the sample counts in your analysis. Exclude sex chromosomes (X, Y, MT) from the analysis.", "external_knowledge": null}
{"instance_id": "sf_bq415", "db": "HUMAN_GENOME_VARIANTS", "question": "List the top 10 samples in the genome data that have the highest number of positions where there is exactly one alternate allele and the sample's genotype is homozygous for the reference allele (both alleles are 0). Order the results in descending order of these counts.", "external_knowledge": "Homozygous_Reference_Genotype.md"}
{"instance_id": "bq279", "db": "austin", "question": "Can you provide the number of distinct active and closed bike share stations for each year 2013 and 2014?", "external_knowledge": null}
{"instance_id": "bq281", "db": "austin", "question": "What is the highest number of electric bike rides lasting more than 10 minutes taken by subscribers with 'Student Membership' in a single day, excluding rides starting or ending at 'Mobile Station' or 'Repair Shop'?", "external_knowledge": null}
{"instance_id": "bq282", "db": "austin", "question": "Can you tell me the numeric value of the active council district in Austin which has the highest number of bike trips that start and end within the same district, but not at the same station?", "external_knowledge": null}
{"instance_id": "bq006", "db": "austin", "question": "What is the date with the second highest Z-score for daily counts of 'PUBLIC INTOXICATION' incidents in Austin for the year 2016? List the date in the format of '2016-xx-xx'.", "external_knowledge": null}
{"instance_id": "sf_bq283", "db": "AUSTIN", "question": "Among all stations that are currently active, identify those that rank in the top 15 (including ties) based on the total number of trips that start at each station. For each of these stations, return the station ID, the total number of starting trips, the percentage of those trips out of the overall starting trips from active stations, and the average trip duration in minutes. Order the results by the station’s rank.", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "bq284", "db": "bbc", "question": "Can you provide a breakdown of the total number of articles into different categories and the percentage of those articles that mention \"education\" within each category from the BBC News?", "external_knowledge": null}
{"instance_id": "bq413", "db": "dimensions_ai_covid19", "question": "Retrieve the venue titles of publications that have a `date_inserted` from the year 2021 onwards and are associated with a grid whose address city is 'Qianjiang'. For each publication, prioritize the venue title by selecting the journal title first if it exists; if not, then the proceedings title; if that's also unavailable, then the book title; and finally, if none of those are available, the book series title.", "external_knowledge": null}
{"instance_id": "bq425", "db": "ebi_chembl", "question": "Using data from ChEMBL Release 23, retrieve all distinct molecules associated with the company 'SanofiAventis,' listing the trade name and the most recent approval date for each molecule. Make sure to keep only the latest approval date per molecule and ensure the company field precisely matches 'SanofiAventis' without relying on other fields.", "external_knowledge": null}
{"instance_id": "bq430", "db": "ebi_chembl", "question": "Find pairs of different molecules tested in the same assay and standard type, where both have 10–15 heavy atoms, fewer than 5 activities in that assay, fewer than 2 duplicate activities, non-null standard values, and pChEMBL values over 10. For each pair, report the maximum heavy atom count, the latest publication date (calculated based on the document's rank within the same journal and year, and map it to a synthetic month and day), the highest document ID, classify the change in standard values as 'increase', 'decrease', or 'no-change' based on their values and relations, and generate UUIDs from their activity IDs and canonical SMILES.", "external_knowledge": "chembl.md"}
{"instance_id": "bq023", "db": "fec", "question": "Using the 2018 5-Year American Community Survey (ACS) for median incomes at the census tract level and the 2020 Federal Election Commission (FEC) individual contributions dataset filtered for donors in New York, matched to census tract geographies via a ZIP code to census tract crosswalk, calculate and list the average political donation amount and the median income for each census tract located in Kings County (Brooklyn), New York.", "external_knowledge": null}
{"instance_id": "bq094", "db": "fec", "question": "Please provide a list of all 2016 committees that supported at least one candidate and received a total amount of individual contributions between $0 and $200 (inclusive of more than $0 and less than $200) where these small-dollar contributions sum to more than $0 overall. For each qualifying committee, include its name, the number of unique candidates it supported, the candidates’ names in alphabetical order (separated by commas), and the total sum of these small-dollar donations received by the committee.", "external_knowledge": null}
{"instance_id": "bq287", "db": "fec", "question": "Among all Utah ZIP codes, what is the 2017 American Community Survey employment rate for the population aged 16 or older in the ZIP code that has the fewest FDIC-insured bank locations?", "external_knowledge": null}
{"instance_id": "bq432", "db": "fda", "question": "Please provide the food events data where both \"date_created\" and \"date_started\" are between January 1 and January 31, 2015, applying all data cleansing steps as specified in the cleansing documentation. This includes splitting the \"reactions\" and \"outcomes\" fields into arrays by commas, handling special numeric patterns in the \"products_brand_name\" field by appropriately splitting and replacing \", \" with \" -- \", replacing \", \" with \" -- \" in the \"products_industry_code\", \"products_role\", and \"products_industry_name\" fields, and calculating \"industry_code_length\" and \"brand_name_length\" as the lengths of the resulting arrays after splitting.", "external_knowledge": "Food_Event_Cleansing_Logic.md"}
{"instance_id": "bq285", "db": "fda", "question": "Could you provide me with the zip code of the location that has the highest number of bank institutions in Florida?", "external_knowledge": null}
{"instance_id": "bq288", "db": "fda", "question": "What is the total number of all banking institutions in the state that has the highest sum of assets from banks established between January 1, 1900, and December 31, 2000, with institution names starting with 'Bank'?", "external_knowledge": null}
{"instance_id": "sf_bq412", "db": "GOOGLE_ADS", "question": "Please retrieve the page URLs, first shown time, last shown time, removal reason, violation category, and the lower and upper bounds of times shown for the five most recently removed ads in the Croatia region (region code 'HR'), where the times shown availability date is null, the times shown lower bound exceeds 10,000, the times shown upper bound is below 25,000, and the ads used at least one non-unused audience selection approach among demographics, geographic location, contextual signals, customer lists, or topics of interest, ordering the resulting ads by their last shown time in descending order.", "external_knowledge": null}
{"instance_id": "sf_bq423", "db": "GOOGLE_ADS", "question": "Between January 1, 2023, and January 1, 2024, which image-type advertisement on the topic of Health, published by a verified advertiser located in Cyprus, was shown in Croatia, has times_shown_availability_date as NULL (meaning the times shown data is available), utilized demographic information, geo-location targeting, contextual signals, customer lists, and topics of interest without any of these selection methods being unused, and additionally had its first shown date strictly after January 1, 2023, and last shown date strictly before January 1, 2024? Among such ads, provide the page URL of the one with the highest upper bound of times shown.", "external_knowledge": null}
{"instance_id": "sf_bq070", "db": "IDC", "question": "Could you provide a clean, structured dataset from dicom_all table that only includes SM images marked as VOLUME from the TCGA-LUAD and TCGA-LUSC collections, excluding any slides with compression type “other,” where the specimen preparation step explicitly has “Embedding medium” set to “Tissue freezing medium,” and ensuring that the tissue type is only “normal” or “tumor” and the cancer subtype is reported accordingly?", "external_knowledge": "dicom_dataset_selection.md"}
{"instance_id": "sf_bq320", "db": "IDC", "question": "In the dicom_pivot table, how many unique StudyInstanceUID values exactly match the SegmentedPropertyTypeCodeSequence of \"15825003\" (case-insensitive) and also have a collection_id of either \"Community\" or \"nsclc_radiomics\"?", "external_knowledge": null}
{"instance_id": "sf_bq321", "db": "IDC", "question": "How many unique StudyInstanceUIDs are there from the DWI, T2 Weighted Axial, Apparent Diffusion Coefficient series, and T2 Weighted Axial Segmentations in the 'qin_prostate_repeatability' collection?", "external_knowledge": null}
{"instance_id": "sf_bq323", "db": "IDC", "question": "Within the 'prostatex' collection, for MRI sequences where the Modality is 'MR', assign the label 't2w_prostateX' to sequences whose SeriesDescription contains 't2_tse_tra' and 'adc_prostateX' to sequences whose SeriesDescription contains 'ADC'. For all sequences labeled as 't2w_prostateX' or 'adc_prostateX', calculate the average Repetition Time, the average Echo Time, and the average Slice Thickness, and then compute the sum of these averages to obtain the combined overall average.", "external_knowledge": null}
{"instance_id": "sf_bq417", "db": "IDC", "question": "Please provide identification details, study and series information, storage location, and total size in MB for the medical images belonging to male patients who are exactly 18 years old based on the numeric portion of the PatientAge field, where the BodyPartExamined is set to 'MEDIASTINUM' and the study date is strictly after September 1, 2014.", "external_knowledge": "IDC_data_model.md"}
{"instance_id": "sf_bq455", "db": "IDC", "question": "Identify the top five CT scan series by size (in MiB), including their SeriesInstanceUID, series number, patient ID, and series size. These series must be from the CT modality and not part of the 'nlst' collection. Exclude any series where the ImageType is classified as 'LOCALIZER' or where the TransferSyntaxUID is either '1.2.840.10008.1.2.4.70' or '1.2.840.10008.1.2.4.51' (i.e., JPEG compressed). The selected series must have consistent slice intervals, exposure levels, image orientation (with only one unique ImageOrientationPatient value), pixel spacing, image positions (both z-axis and xy positions), and pixel dimensions (rows and columns). Ensure that the number of images matches the number of unique z-axis positions, indicating no duplicate slices. Additionally, the z-axis component of the cross product of the x and y direction cosines from ImageOrientationPatient must have an absolute value between 0.99 and 1.01, ensuring alignment with the expected imaging plane. Finally, order the results by series size in descending order and limit the output to the top five series satisfying these conditions.", "external_knowledge": null}
{"instance_id": "sf_bq456", "db": "IDC", "question": "Please retrieve from the dicom_all table each PatientID, StudyInstanceUID, StudyDate, and the CodeMeaning of the FindingSite for patients whose StudyDate is in the year 2001, along with the maximum values of each of the following measurements identified by their CodeMeaning (Elongation, Flatness, Least Axis in 3D Length, Major Axis in 3D Length, Maximum 3D Diameter of a Mesh, Minor Axis in 3D Length, Sphericity, Surface Area of Mesh, Surface to Volume Ratio, Volume from Voxel Summation, and Volume of Mesh), ensuring that the quantitative_measurements table is joined on segmentationInstanceUID matching the SOPInstanceUID in dicom_all, and grouping by PatientID, StudyInstanceUID, StudyDate, and FindingSite CodeMeaning.", "external_knowledge": null}
{"instance_id": "sf_bq324", "db": "IDC", "question": "How many frames in total are present across all whole slide microscopy images from the TCGA-BRCA collection that use the SM modality and include an eosin-based staining step in their SpecimenPreparationSequence?", "external_knowledge": null}
{"instance_id": "bq418", "db": "targetome_reactome", "question": "Determine which three lowest-level Reactome pathways (with TAS evidence) have the highest chi-squared statistics, considering only Homo sapiens targets associated with sorafenib under the conditions that the median assay value is ≤ 100 and both low and high assay values are ≤ 100 or null. For each of these three pathways, how many of these targets and non-targets lie within the pathway and outside it?", "external_knowledge": null}
{"instance_id": "bq330", "db": "fda", "question": "Which Colorado zip code has the highest concentration of bank locations per block group, based on the overlap between zip codes and block groups?", "external_knowledge": "overlap_ratio.md"}
{"instance_id": "bq398", "db": "world_bank", "question": "What are the top three debt indicators for Russia based on the highest debt values?", "external_knowledge": null}
{"instance_id": "bq399", "db": "world_bank", "question": "Which high-income country had the highest average crude birth rate respectively in each region, and what are their corresponding average birth rate, during the 1980s?", "external_knowledge": null}
{"instance_id": "bq457", "db": "libraries_io", "question": "Get details of repositories that use specific feature toggle libraries. For each repository, include the full name with owner, hosting platform type, size in bytes, primary programming language, fork source name (if any), last update timestamp, the artifact and library names of the feature toggle used, and the library's programming languages. Include repositories that depend on the specified feature toggle libraries, defined by their artifact names, library names, platforms, and languages.", "external_knowledge": "feature_toggle_libraries.md"}
{"instance_id": "bq227", "db": "london", "question": "Could you provide the annual percentage shares, rounded to two decimal places, of the top 5 minor crime categories from 2008 in London's total crimes, with each year displayed in one row?", "external_knowledge": null}
{"instance_id": "bq232", "db": "london", "question": "Could you provide the total number of 'Other Theft' incidents within the 'Theft and Handling' category for each year in the Westminster borough?", "external_knowledge": null}
{"instance_id": "bq228", "db": "london", "question": "Please provide a list of the top three major crime categories in the borough of Barking and Dagenham, along with the number of incidents in each category.", "external_knowledge": null}
{"instance_id": "bq229", "db": "open_images", "question": "Can you provide a count of how many image URLs are categorized as ‘cat’ (with label '/m/01yrx' and full confidence) and how many contain no such cat labels(categorized as ‘other’) at all? ", "external_knowledge": null}
{"instance_id": "bq230", "db": "usda_nass_agriculture", "question": "Using the crops dataset, find the total 2022 production figures, measured in bushels, for corn from the 'FIELD CROPS' category and mushrooms from the 'HORTICULTURE' group for each U.S. state. Only include data rows where 'statisticcat_desc' is 'PRODUCTION', 'agg_level_desc' is 'STATE', 'value' is not null, and ensure that for corn the 'unit_desc' is 'BU'. Combine both results so that each state’s 2022 corn and mushroom totals are presented.", "external_knowledge": null}
{"instance_id": "bq326", "db": "world_bank", "question": "Based on the World Bank global population dataset and the World Bank health nutrition population dataset, how many countries experienced an increase of more than 1% from the previous year to 2018 in both their total population and per capita current health expenditure (PPP)?", "external_knowledge": null}
{"instance_id": "bq424", "db": "world_bank", "question": "List the top 10 countries with respect to the total amount of long-term external debt in descending order, excluding those without a specified region.", "external_knowledge": null}
{"instance_id": "bq327", "db": "world_bank", "question": "How many debt indicators for Russia have a value of 0, excluding NULL values?", "external_knowledge": null}
{"instance_id": "bq328", "db": "world_bank", "question": "Which region has the highest median GDP (constant 2015 US$) value?", "external_knowledge": null}
{"instance_id": "sf_bq331", "db": "META_KAGGLE", "question": "Find the top three users who have authored the first message in forum topics, ranked in descending order by their message scores, where a message score is defined as the number of distinct users who voted on that message. For each of these users, provide their username and the absolute difference between their message score and the average message score across all first messages in forum topics.", "external_knowledge": null}
{"instance_id": "sf_bq380", "db": "META_KAGGLE", "question": "Using the data from Forum Message Votes and Users tables, find the top three users who have received the most distinct upvotes on the Kaggle forum. For each of these users, list their username, the total number of distinct upvotes they have received (based on the ToUserId field), and the total number of distinct upvotes they have given (based on the FromUserId field), sorted by the number of upvotes received in descending order and restricted to only the top three results.", "external_knowledge": null}
{"instance_id": "sf_bq370", "db": "WIDE_WORLD_IMPORTERS", "question": "How many customers have orders and invoices that match at the line-item level and, when aggregated, result in each customer having an equal count of orders and invoices as well as an identical total value for the orders and invoices?", "external_knowledge": null}
{"instance_id": "sf_bq371", "db": "WIDE_WORLD_IMPORTERS", "question": "In the year 2013, considering each invoice’s total value as the product of unit price and quantity and grouping by the quarter (Q1, Q2, Q3, Q4) in which the invoice date occurs, what is the difference between the maximum and minimum average invoice values across these quarters?", "external_knowledge": null}
{"instance_id": "sf_bq372", "db": "WIDE_WORLD_IMPORTERS", "question": "Among all orders that do not appear in the invoice table, for each customer category calculate the maximum lost order value, then determine which customer category’s maximum lost order value is closest to the overall average of these maximum lost order values across all categories?", "external_knowledge": null}
{"instance_id": "sf_bq373", "db": "WIDE_WORLD_IMPORTERS", "question": "Using the invoice date to determine each month of the year 2014, and summing the total invoice line amounts for each customer across these months, what is the median of the resulting average monthly spending across all customers?", "external_knowledge": null}
{"instance_id": "bq393", "db": "hacker_news", "question": "Can you tell me the ID and corresponding month number of the user with the highest month number who became inactive after their last recorded activity month, considering data only up until September 10, 2024?", "external_knowledge": null}
{"instance_id": "bq403", "db": "irs_990", "question": "Which three years in 2012-2017 have the smallest absolute difference between median revenue and median functional expenses for organizations filing IRS 990 forms? Please output three years and respective differences.", "external_knowledge": null}
{"instance_id": "bq397", "db": "ecommerce", "question": "After removing any duplicate records from the rev_transactions dataset, identify each channel grouping that has transactions from more than one country. For each such channel grouping, find the country with the highest total number of transactions and report both the country name and the sum of transactions for that channel grouping.", "external_knowledge": null}
{"instance_id": "bq402", "db": "ecommerce", "question": "Calculate the conversion rate from unique visitors to purchasers by using data exclusively from the `web_analytics` table in the `data-to-insights.ecommerce` dataset. A visitor is defined as a unique `fullVisitorId` present in the table, while a purchaser is a visitor who has at least one transaction recorded (`totals.transactions` is not null). The conversion rate is computed by dividing the number of unique purchasers by the total number of unique visitors. Additionally, calculate the average number of transactions per purchaser, considering only those visitors who have made at least one transaction.", "external_knowledge": null}
{"instance_id": "sf_bq160", "db": "META_KAGGLE", "question": "Please provide the creation date, title, parent forum title, reply count, distinct user replies count, total upvotes, and total views for the earliest five forum topics that belong to any sub-forum under the parent forum titled \"General\". The reply count should be calculated as the total number of messages associated with each topic, and the distinct user replies count should be the number of unique users who have posted messages in the topic. The upvotes should be calculated as the total number of upvotes on all messages within each topic. If any values are missing or None, please treat them as zero", "external_knowledge": null}
{"instance_id": "sf_bq167", "db": "META_KAGGLE", "question": "Identify the pair of Kaggle users involved in ForumMessageVotes such that one user has given the other the greatest distinct number of upvotes, then also display how many upvotes that recipient returned. Present the usernames of both users, the total distinct upvotes one received from the other, and the upvotes they gave back, sorting by the highest received count and then by the highest given count, and show only the top result.", "external_knowledge": null}
{"instance_id": "sf_bq171", "db": "META_KAGGLE", "question": "Whose Forum message upvotes are closest to the average in 2019? If there’s a tie, tell me the one with the alphabetically first username.", "external_knowledge": null}
{"instance_id": "sf_bq118", "db": "DEATH", "question": "Among individuals identified as white, how much higher is the average number of deaths from ICD-10 codes whose descriptions contain the word “discharge” (specifically excluding “Urethral discharge,” “Discharge of firework,” and “Legal intervention involving firearm discharge”) compared to the average number of deaths from ICD-10 codes whose descriptions contain the word “vehicle,” when aggregated by age groups?", "external_knowledge": null}
{"instance_id": "sf_bq072", "db": "DEATH", "question": "Please provide, for each age from 12 through 18 (inclusive), the total number of deaths and the number of deaths among individuals identified as Black (based on race descriptions containing the word ‘black’), specifically for deaths associated with ICD-10 codes whose descriptions include the word ‘vehicle’ and for deaths associated with ICD-10 codes whose descriptions include the word ‘firearm.’ Use the EntityAxisConditions table to determine which ICD-10 codes were involved in each death, rather than joining ICD-10 code information directly on the death records.", "external_knowledge": null}
{"instance_id": "ga001", "db": "ga4", "question": "I want to know the preferences of customers who purchased the Google Navy Speckled Tee in December 2020. What other product was purchased with the highest total quantity alongside this item?", "external_knowledge": null}
{"instance_id": "ga002", "db": "ga4", "question": "Tell me the most purchased other products and their quantities by customers who bought the Google Red Speckled Tee each month for the three months starting from November 2020.", "external_knowledge": null}
{"instance_id": "ga003", "db": "firebase", "question": "I'm trying to evaluate which board types were most effective on September 15, 2018. Can you find out the average scores for each board type from the quick play mode completions on that day?", "external_knowledge": null}
{"instance_id": "ga004", "db": "ga4", "question": "Can you figure out the average difference in pageviews between users who bought something and those who didn’t in December 2020? Just label anyone who was involved in purchase events as a purchaser.", "external_knowledge": null}
{"instance_id": "ga008", "db": "ga4", "question": "Based on the logs from November 2020, calculate the total number of page views for each day, as well as the average number of page views per user on that day, restricted to users who had at least one page view on that day.", "external_knowledge": null}
{"instance_id": "ga017", "db": "ga4", "question": "How many distinct users viewed the most frequently visited page during January 2021?", "external_knowledge": null}
{"instance_id": "ga007", "db": "ga4", "question": "Please find out what percentage of the page views on January 2, 2021, were for PDP type pages.", "external_knowledge": "ga4_page_category.md"}
{"instance_id": "ga013", "db": "ga4", "question": "I want to know all the pages visited by user 1402138.5184246691 on January 2, 2021. Please show the names of these pages and adjust the names to PDP or PLP where necessary.", "external_knowledge": "ga4_page_category.md"}
{"instance_id": "ga018", "db": "ga4", "question": "On January 2nd, 2021, I want to determine the percentage of times users transition from a product list page (PLP) view to a product detail page (PDP) view within the same session, using only page_view events. Could you calculate how many PLP views eventually led to a PDP view in the same session on that date, and then provide the resulting percentage of PLP-to-PDP transitions?", "external_knowledge": "ga4_page_category.md"}
{"instance_id": "ga031", "db": "ga4", "question": "I want to know the user session conversion rate on January 2nd, 2021, using only 'page_view' events. The conversion rate should be calculated as the percentage of user visits that reached both the Home and Checkout Confirmation pages in one session, relative to those that landed on the Home page.", "external_knowledge": null}
{"instance_id": "ga032", "db": "ga4", "question": "Can you pull up the sequence of pages our customer 1362228 visited on January 28th 2021, linking them with '>>' between each page? I want to see their navigation flow through our site. Please refer to the docs to convert the corresponding page title to \"PDP\" or \"PLP\" if necessary and merge adjacent identical page titles into one.", "external_knowledge": "ga4_page_category.md"}
{"instance_id": "ga006", "db": "ga4", "question": "For the date range November 1–30, 2020, can you retrieve each user_pseudo_id and its average purchase revenue in USD per session for users who had more than one purchase session, considering only events with event_name='purchase' and a non-null ecommerce.purchase_revenue_in_usd, grouping sessions by the ga_session_id from event_params", "external_knowledge": null}
{"instance_id": "ga009", "db": "ga4", "question": "Could you tell me the average number of engaged sessions per user for December 2020, counting only those sessions where the event parameter 'session_engaged' is equal to '1' and using 'user_pseudo_id' combined with the 'ga_session_id' to identify distinct sessions?", "external_knowledge": null}
{"instance_id": "ga010", "db": "ga4", "question": "Can you give me an overview of our website traffic for December 2020? I'm particularly interested in the channel with the fourth highest number of sessions.", "external_knowledge": "ga4_dimensions_and_metrics.md"}
{"instance_id": "ga014", "db": "ga4", "question": "Can you provide the total number of sessions for each traffic channel in December 2020, using the information from the 'event_params'", "external_knowledge": "ga4_dimensions_and_metrics.md"}
{"instance_id": "ga011", "db": "ga4", "question": "What is the page with the second highest total page views, after cleaning up its URL (removing extra slashes) and extracting the correct page path, on the website 'shop.googlemerchandisestore.com' during December 2020?", "external_knowledge": null}
{"instance_id": "ga012", "db": "ga4", "question": "On November 30, 2020, identify the item category with the highest tax rate by dividing tax value in usd by purchase revenue in usd for purchase events, and then retrieve the transaction IDs, total item quantities, and both purchase revenue in usd and purchase revenue for those purchase events in that top-tax-rate category.", "external_knowledge": null}
{"instance_id": "ga019", "db": "firebase", "question": "Could you determine what percentage of users either did not uninstall our app within seven days or never uninstalled it after installing during August and September 2018?", "external_knowledge": null}
{"instance_id": "ga030", "db": "firebase", "question": "Can you group users by the week of their first session start, starting from July 2, 2018? For each group, calculate the retention rate in the fourth week (i.e., the percentage of users from the original group who returned in the fourth week after their first session). Please identify the cohort with the highest retention rate in the fourth week, and name the group by the Monday date of the cohort's first session week. Return the result in the format \"YYYY-MM-DD\".", "external_knowledge": "retention_rate.md"}
{"instance_id": "ga005", "db": "firebase", "question": "Conduct a weekly cohort analysis for user retention starting from July 9, 2018, grouping users by the week of their first session start event. Calculate the retention rates for each cohort in the subsequent two weeks. Specifically, measure the percentage of users from the original cohort who returned in week 2 (two weeks after their first session). Only include users whose first session start event occurred after July 9, 2018. The data is available up to October 2, 2018 (meaning the last cohort group is for the week of September 17, 2018). For this analysis, retention is calculated based on the first session start date, and cohorts are grouped weekly starting from the first session.", "external_knowledge": "retention_rate.md"}
{"instance_id": "ga028", "db": "firebase", "question": "Please perform a 7-day retention analysis for users who first session start the app during the week starting on July 2, 2018. For each week from Week 0 (the week of their first session) to Week 4, provide the total number of new users in Week 0 and the number of retained users for each subsequent week. Ensuring that you only count events up to October 2, 2018, and group dates by Monday-based weeks", "external_knowledge": "retention_rate.md"}
{"instance_id": "ga020", "db": "firebase", "question": "Which quickplay event type had the lowest user retention rate during the second week after their initial engagement, for users who first engaged between August 1 and August 15, 2018, as measured by the presence of session_start events??", "external_knowledge": "retention_rate.md"}
{"instance_id": "ga021", "db": "firebase", "question": "What is the retention rate for users two weeks after their initial quickplay event, calculated separately for each quickplay event type, within the period from July 2, 2018, to July 16, 2018? Please focus on users who started a session (session_start) during this period.", "external_knowledge": "retention_rate.md"}
{"instance_id": "ga022", "db": "firebase", "question": "Could you please help me get the weekly customer retention rate in September 2018 for new customers who first used our app (first_open event) within the first week starting from September 1st, 2018 (timezone in Shanghai)? The retention rates should cover the following weeks 1, 2, and 3 period after the initial use and display them in column format.", "external_knowledge": "retention_rate.md"}
{"instance_id": "ga025", "db": "firebase", "question": "For all users who first opened the app in September 2018 and then uninstalled within seven days, I want to know what percentage of them experienced an app crash (app_exception). The calculation should be done by converting the timestamps to dates first, and then calculating the days to uninstall based on the dates. Only users who uninstalled within 7 days and experienced a crash should be considered in the final percentage.", "external_knowledge": null}
{"instance_id": "local002", "db": "E_commerce", "question": "Can you calculate the 5-day symmetric moving average of predicted toy sales for December 5 to 8, 2018, using daily sales data from January 1, 2017, to August 29, 2018, with a simple linear regression model? Finally provide the sum of those four 5-day moving averages?", "external_knowledge": null}
{"instance_id": "local003", "db": "E_commerce", "question": "According to the RFM definition document, calculate the average sales per order for each customer within distinct RFM segments, considering only 'delivered' orders. Use the customer unique identifier. Clearly define how to calculate Recency based on the latest purchase timestamp and specify the criteria for classifying RFM segments. The average sales should be computed as the total spend divided by the total number of orders. Please analyze and report the differences in average sales across the RFM segments", "external_knowledge": "RFM.md"}
{"instance_id": "local004", "db": "E_commerce", "question": "Could you tell me the number of orders, average payment per order and customer lifespan in weeks of the 3 custumers with the highest average payment per order, where the lifespan is calculated by subtracting the earliest purchase date from the latest purchase date in days, dividing by seven, and if the result is less than seven days, setting it to 1.0?", "external_knowledge": null}
{"instance_id": "local007", "db": "Baseball", "question": "Could you help me calculate the average single career span value in years for all baseball players? Please precise the result as a float number. First, calculate the difference in years, months, and days between the debut and final game dates. For each player, the career span is computed as the sum of the absolute number of years, plus the absolute number of months divided by 12, plus the absolute number of days divided by 365. Round each part to two decimal places before summing. Finally, average the career spans and round the result to a float number.", "external_knowledge": null}
{"instance_id": "local008", "db": "Baseball", "question": "I would like to know the given names of baseball players who have achieved the highest value of games played, runs, hits, and home runs, with their corresponding score values.", "external_knowledge": null}
{"instance_id": "local009", "db": "Airlines", "question": "What is the distance of the longest route where Abakan is either the departure or destination city (in kilometers)?", "external_knowledge": "haversine_formula.md"}
{"instance_id": "local010", "db": "Airlines", "question": "Distribute all the unique city pairs into the distance ranges 0, 1000, 2000, 3000, 4000, 5000, and 6000+, based on their average distance of all routes between them. Then how many pairs are there in the distance range with the fewest unique city paires?", "external_knowledge": "haversine_formula.md"}
{"instance_id": "local015", "db": "California_Traffic_Collision", "question": "Please calculate, separately for motorcycle collisions involving riders who were wearing helmets and those who were not wearing helmets, the percentage of motorcyclist fatalities per collision. For each group, compute this by dividing the total number of motorcyclist fatalities by the total number of collisions involving that group.", "external_knowledge": null}
{"instance_id": "local017", "db": "California_Traffic_Collision", "question": "In which year were the two most common causes of traffic accidents different from those in other years?", "external_knowledge": null}
{"instance_id": "local018", "db": "California_Traffic_Collision", "question": "For the primary collision factor violation category that was the most common cause of traffic accidents in 2021, how many percentage points did its share of annual road incidents in 2021 decrease compared to its share in 2011?", "external_knowledge": null}
{"instance_id": "local019", "db": "WWE", "question": "For the NXT title that had the shortest match (excluding titles with \"title change\"), what were the names of the two wrestlers involved?", "external_knowledge": null}
{"instance_id": "local026", "db": "IPL", "question": "Please help me identify the top 3 bowlers who, in the overs where the maximum runs were conceded in each match, gave up the highest number of runs in a single over across all matches. For each of these bowlers, provide the match in which they conceded these maximum runs. Only consider overs that had the most runs conceded within their respective matches, and among these, determine which bowlers conceded the most runs in a single over overall.", "external_knowledge": null}
{"instance_id": "local020", "db": "IPL", "question": "Which bowler has the lowest bowling average per wicket taken?", "external_knowledge": null}
{"instance_id": "local021", "db": "IPL", "question": "Could you calculate the average of the total runs scored by all strikers who have scored more than 50 runs in any single match?", "external_knowledge": null}
{"instance_id": "local022", "db": "IPL", "question": "Retrieve the names of players who scored no less than 100 runs in a match while playing for the team that lost that match.", "external_knowledge": null}
{"instance_id": "local023", "db": "IPL", "question": "Please help me find the names of top 5 players with the highest average runs per match in season 5, along with their batting averages.", "external_knowledge": null}
{"instance_id": "local024", "db": "IPL", "question": "Can you help me find the top 5 countries whose players have the highest average of their individual average runs per match across all seasons? Specifically, for each player, calculate their average runs per match over all matches they played, then compute the average of these player averages for each country, and include these country batting averages in the result.", "external_knowledge": null}
{"instance_id": "local025", "db": "IPL", "question": "For each match, considering every innings, please combine runs from both batsman scored and extra runs for each over, then identify the single over with the highest total runs, retrieve the bowler for that over from the ball by ball table, and calculate the average of these highest over totals across all matches, ensuring that all runs and bowler details are accurately reflected.", "external_knowledge": null}
{"instance_id": "local028", "db": "Brazilian_E_Commerce", "question": "Could you generate a report that shows the number of delivered orders for each month in the years 2016, 2017, and 2018? Each column represents a year, and each row represents a month", "external_knowledge": null}
{"instance_id": "local031", "db": "Brazilian_E_Commerce", "question": "What is the highest monthly delivered orders volume in the year with the lowest annual delivered orders volume among 2016, 2017, and 2018?", "external_knowledge": null}
{"instance_id": "local029", "db": "Brazilian_E_Commerce", "question": "Please identify the top three customers, based on their customer_unique_id, who have the highest number of delivered orders, and provide the average payment value, city, and state for each of these customers.", "external_knowledge": null}
{"instance_id": "local030", "db": "Brazilian_E_Commerce", "question": "Among all cities with delivered orders, find the five cities whose summed payments are the lowest, then calculate the average of their total payments and the average of their total delivered order counts.", "external_knowledge": null}
{"instance_id": "local032", "db": "Brazilian_E_Commerce", "question": "Could you help me find the sellers respectively with the highest number of distinct customers, highest profit, highest number of distinct orders, and most 5-star ratings, in delivered orders, along with their corresponding values? ", "external_knowledge": null}
{"instance_id": "local034", "db": "Brazilian_E_Commerce", "question": "Could you help me calculate the average of the total number of payments made using the most preferred payment method for each product category, where the most preferred payment method in a category is the one with the highest number of payments?", "external_knowledge": null}
{"instance_id": "local037", "db": "Brazilian_E_Commerce", "question": "Identify the top three product categories whose most commonly used payment type has the highest number of payments across all categories, and specify the number of payments made in each category using that payment type.", "external_knowledge": null}
{"instance_id": "local035", "db": "Brazilian_E_Commerce", "question": "In the “olist_geolocation” table, please identify which two consecutive cities, when sorted by geolocation_state, geolocation_city, geolocation_zip_code_prefix, geolocation_lat, and geolocation_lng, have the greatest distance between them based on the difference in distance computed between each city and its immediate predecessor in that ordering.", "external_knowledge": "spherical_law.md"}
{"instance_id": "local038", "db": "Pagila", "question": "Could you help me determine which actor starred most frequently in English-language children's category films that were rated either G or PG, had a running time of 120 minutes or less, and were released between 2000 and 2010? Please provide the actor's full name.", "external_knowledge": null}
{"instance_id": "local039", "db": "Pagila", "question": "Please help me find the film category with the highest total rental hours in cities where the city's name either starts with \"A\" or contains a hyphen. ", "external_knowledge": null}
{"instance_id": "local040", "db": "modern_data", "question": "In the combined dataset that unifies the trees data with the income data by ZIP code, filling missing ZIP values where necessary, which three boroughs, restricted to records with median and mean income both greater than zero and a valid borough name, contain the highest number of trees, and what is the average mean income for each of these three boroughs?", "external_knowledge": null}
{"instance_id": "local041", "db": "modern_data", "question": "What percentage of trees in the Bronx have a health status of Good?", "external_knowledge": null}
{"instance_id": "local049", "db": "modern_data", "question": "Can you help me calculate the average number of new unicorn companies per year in the top industry from 2019 to 2021?", "external_knowledge": null}
{"instance_id": "local054", "db": "chinook", "question": "Could you tell me the first names of customers who spent less than $1 on albums by the best-selling artist, along with the amounts they spent?", "external_knowledge": null}
{"instance_id": "local055", "db": "chinook", "question": "Identify the artist with the highest overall sales of albums (tie broken by alphabetical order) and the artist with the lowest overall sales of albums (tie broken by alphabetical order), then calculate the amount each customer spent specifically on those two artists’ albums. Next, compute the average spending for the customers who purchased from the top-selling artist and the average spending for the customers who purchased from the lowest-selling artist, and finally return the absolute difference between these two averages.", "external_knowledge": null}
{"instance_id": "local198", "db": "chinook", "question": "Using the sales data, what is the median value of total sales made in countries where the number of customers is greater than 4?", "external_knowledge": null}
{"instance_id": "local056", "db": "sqlite-sakila", "question": "Which customer has the highest average monthly change in payment amounts? Provide the customer's full name.", "external_knowledge": null}
{"instance_id": "local058", "db": "education_business", "question": "Can you provide a list of hardware product segments along with their unique product counts for 2020 in the output, ordered by the highest percentage increase in unique fact sales products from 2020 to 2021?", "external_knowledge": null}
{"instance_id": "local059", "db": "education_business", "question": "For the calendar year 2021, what is the overall average quantity sold of the top three best-selling hardware products (by total quantity sold) in each division?", "external_knowledge": null}
{"instance_id": "local060", "db": "complex_oracle", "question": "In the United States, for Q4 2019 and Q4 2020, first select only those cities where total sales (with no promotions) rose by at least 20% from Q4 2019 to Q4 2020. Among these cities, rank products by their overall sales (still excluding promotions) in those quarters and take the top 20%. Then compute each top product’s share of total sales in Q4 2019 and Q4 2020 and calculate the difference in share from Q4 2019 to Q4 2020, returning the results in descending order of that share change.", "external_knowledge": null}
{"instance_id": "local063", "db": "complex_oracle", "question": "Among all products sold in the United States with promo_id=999, considering only those cities whose sales increased by at least 20% from Q4 2019 (calendar_quarter_id=1772) to Q4 2020 (calendar_quarter_id=1776), which product that ranks in the top 20% of total sales has the smallest percentage-point change in its share of total sales between these two quarters?", "external_knowledge": null}
{"instance_id": "local061", "db": "complex_oracle", "question": "What is the average projected monthly sales in USD for France in 2021, considering only product sales with promotions where promo_total_id = 1 and channels where channel_total_id = 1, by taking each product’s monthly sales from 2019 and 2020, calculating the growth rate from 2019 to 2020 for that same product and month, applying this growth rate to project 2021 monthly sales, converting all projected 2021 amounts to USD with the 2021 exchange rates, and finally averaging and listing them by month?", "external_knowledge": "projection_calculation.md"}
{"instance_id": "local050", "db": "complex_oracle", "question": "What is the median of the average monthly projected sales in USD for France in 2021, calculated by using the monthly sales data from 2019 and 2020 (filtered by promo_total_id=1 and channel_total_id=1), applying the growth rate from 2019 to 2020 to project 2021, converting to USD based on the currency table, and then determining the monthly averages before finding their median?", "external_knowledge": "projection_calculation.md"}
{"instance_id": "local062", "db": "complex_oracle", "question": "Please group all Italian customers into ten buckets for December 2021 by summing their profits from all products purchased (where profit is calculated as quantity_sold multiplied by the difference between unit_price and unit_cost), then divide the overall range of total monthly profits into ten equal intervals. For each bucket, provide the number of customers, and identify the minimum and maximum total profits within that bucket.", "external_knowledge": null}
{"instance_id": "local067", "db": "complex_oracle", "question": "Can you provide the highest and lowest profits for Italian customers segmented into ten evenly divided tiers based on their December 2021 sales profits?", "external_knowledge": null}
{"instance_id": "local070", "db": "city_legislation", "question": "Please examine our records for Chinese cities in July 2021 and identify both the shortest and longest streaks of consecutive date entries. List the dates along with their corresponding city names, capitalizing the first letter of each city name, for these streaks.", "external_knowledge": null}
{"instance_id": "local071", "db": "city_legislation", "question": "Could you review our records in June 2022 and identify which countries have the longest streak of consecutive inserted city dates? Please list the 2-letter length country codes of these countries.", "external_knowledge": null}
{"instance_id": "local072", "db": "city_legislation", "question": "Identify the country with data inserted on nine different days in January 2022. Then, find the longest consecutive period with data insertions for this country during January 2022, and calculate the proportion of entries that are from its capital city within this longest consecutive insertion period.", "external_knowledge": null}
{"instance_id": "local068", "db": "city_legislation", "question": "Calculate the number of new cities inserted in April, May, and June for each year from 2021 to 2023. For each month, compute the cumulative running total of cities added for that specific month across the years up to and including the given year (i.e., sum the counts of that month over the years). Additionally, calculate the year-over-year growth percentages for both the monthly total and the running total for each month, comparing each year to the previous year. Present the results only for 2022 and 2023, listing the year, the month, the total number of cities added in that month, the cumulative running total for that month, and the year-over-year growth percentages for both the monthly total and the running total. Use the data from 2021 solely as a baseline for calculating growth rates, and exclude it from the final output.", "external_knowledge": null}
{"instance_id": "local073", "db": "modern_data", "question": "For each pizza order, please list a single result row containing the row ID, order ID, customer ID, pizza name, and the final set of ingredients. The final set of ingredients should be determined by taking the standard toppings from the pizza’s recipe, removing any toppings specified as exclusions, then adding any toppings specified as extras. The resulting ingredients must be presented in alphabetical order, prefixed by ‘2x’ if the same ingredient appears multiple times (for example, if both standard and extra or added multiple times), and concatenated into a string that begins with the pizza name followed by ‘: ’. Group by row ID, order ID, pizza name, and order time so that each order appears once, and return the listing of toppings in ascending order by row ID, ensuring that ‘Meatlovers’ pizzas receive a pizza_id of 1 while all others receive a pizza_id of 2.", "external_knowledge": null}
{"instance_id": "local066", "db": "modern_data", "question": "Based on our customer pizza order information, summarize the total quantity of each ingredient used in the pizzas we delivered. Output the name and quantity for each ingredient.", "external_knowledge": null}
{"instance_id": "local065", "db": "modern_data", "question": "Calculate the total income from Meat Lovers pizzas priced at $12 and Vegetarian pizzas at $10. Include any extra toppings charged at $1 each. Ensure that canceled orders are filtered out. How much money has Pizza Runner earned in total?", "external_knowledge": null}
{"instance_id": "local074", "db": "bank_sales_trading", "question": "Please generate a summary of the closing balances at the end of each month for each customer transactions, show the monthly changes and monthly cumulative bank account balances. Ensure that even if a customer has no account activity in a given month, the balance for that month is still included in the output.", "external_knowledge": null}
{"instance_id": "local064", "db": "bank_sales_trading", "question": "For each customer and each month of 2020, first calculate the month-end balance by adding all deposit amounts and subtracting all withdrawal amounts that occurred during that specific month. Then determine which month in 2020 has the highest count of customers with a positive month-end balance and which month has the lowest count. For each of these two months, compute the average month-end balance across all customers and provide the difference between these two averages", "external_knowledge": null}
{"instance_id": "local297", "db": "bank_sales_trading", "question": "For each customer, group all deposits and withdrawals by the first day of each month to obtain a monthly net amount, then calculate each month’s closing balance by cumulatively summing these monthly nets. Next, determine the most recent month’s growth rate by comparing its closing balance to the prior month’s balance, treating deposits as positive and withdrawals as negative, and if the previous month’s balance is zero, the growth rate should be the current month’s balance multiplied by 100. Finally, compute the percentage of customers whose most recent month shows a growth rate of more than 5%.", "external_knowledge": null}
{"instance_id": "local298", "db": "bank_sales_trading", "question": "For each month, calculate the total balance from all users for the previous month (measured as of the 1st of each month), replacing any negative balances with zero. Ensure that data from the first month is used only as a baseline for calculating previous total balance, and exclude it from the final output. Sort the results in ascending order by month. ", "external_knowledge": null}
{"instance_id": "local299", "db": "bank_sales_trading", "question": "Could you calculate each user’s average balance over the past 30 days, computed daily? Then, for each month (based on the 1st of each month), find the highest of these daily averages for each user. Add up these maximum values across all users for each month as the final result. Please use the first month as a baseline for previous balances and exclude it from the output.", "external_knowledge": null}
{"instance_id": "local300", "db": "bank_sales_trading", "question": "For each customer, calculate their daily balances for every day between their earliest and latest transaction dates, including days without transactions by carrying forward the previous day's balance. Treat any negative daily balances as zero. Then, for each month, determine the highest daily balance each customer had during that month. Finally, for each month, sum these maximum daily balances across all customers to obtain a monthly total.", "external_knowledge": null}
{"instance_id": "local075", "db": "bank_sales_trading", "question": "Can you provide a breakdown of how many times each product was viewed, how many times they were added to the shopping cart, and how many times they were left in the cart without being purchased? Also, give me the count of actual purchases for each product. Ensure that products with a page id in (1, 2, 12, 13) are filtered out.", "external_knowledge": null}
{"instance_id": "local077", "db": "bank_sales_trading", "question": "Please analyze our interest data from September 2018 to August 2019. For each month, calculate the average composition for each interest by dividing the composition by the index value. Identify the interest with the highest average composition value each month and report its average composition as the max index composition for that month. Compute the three-month rolling average of these monthly max index compositions. Ensure the output includes the date, the interest name, the max index composition for that month, the rolling average, and the names and max index compositions of the top interests from one month ago and two months ago.", "external_knowledge": null}
{"instance_id": "local078", "db": "bank_sales_trading", "question": "Identify the top 10 and bottom 10 interest categories based on their highest composition values across all months. For each category, display the time(MM-YYYY), interest name, and the composition value", "external_knowledge": null}
{"instance_id": "local081", "db": "northwind", "question": "Considering only the customers who placed orders in 1998, calculate the total amount each customer spent by summing the unit price multiplied by the quantity of all products in their orders, excluding any discounts. Assign each customer to a spending group based on the customer group thresholds, and determine how many customers are in each spending group and what percentage of the total number of customers who placed orders in 1998 each group represents.", "external_knowledge": null}
{"instance_id": "local085", "db": "northwind", "question": "Among employees who have more than 50 total orders, which three have the highest percentage of late orders, where an order is considered late if the shipped date is on or after its required date? Please list each employee's ID, the number of late orders, and the corresponding late-order percentage.", "external_knowledge": null}
{"instance_id": "local096", "db": "Db-IMDB", "question": "For each year, calculate the proportion of films that had exclusively female actors, considering actors with gender 'Male' or 'None' (i.e., unknown or unspecified gender) as non-female. Show the proportion of female-actor-only films and the total number of all films for each year.", "external_knowledge": null}
{"instance_id": "local097", "db": "Db-IMDB", "question": "Could you analyze our data and identify which ten-year period starting from any movie release year present in the data had the largest number of films, considering consecutive ten-year periods beginning at each unique year? Only output the start year and the total count for that specific period.", "external_knowledge": null}
{"instance_id": "local098", "db": "Db-IMDB", "question": "From the first year each actor appeared in a film to the last, how many actors in the database never had a gap longer than three consecutive years without at least one new movie appearance, meaning there is no four-year span anywhere in their active career without at least a single film credit?", "external_knowledge": null}
{"instance_id": "local099", "db": "Db-IMDB", "question": "I need you to look into the actor collaborations and tell me how many actors have made more films with Yash Chopra than with any other director. This will help us understand his influence on the industry better.", "external_knowledge": null}
{"instance_id": "local100", "db": "Db-IMDB", "question": "Find out how many actors have a 'Shahrukh number' of 2? This means they acted in a film with someone who acted with Shahrukh Khan, but not directly with him.", "external_knowledge": null}
{"instance_id": "local114", "db": "education_business", "question": "Provide a detailed web sales report for each region, including the number of orders, total sales amount, and the name and sales amount of all sales representatives who achieved the highest total sales amount in that region (include all representatives in case of a tie).", "external_knowledge": null}
{"instance_id": "local128", "db": "BowlingLeague", "question": "List the bowlers, match number, game number, handicap score, tournament date, and location for only those bowlers who won their game with a handicap score of 190 or less at Thunderbird Lanes, Totem Lanes, and Bolero Lanes.", "external_knowledge": null}
{"instance_id": "local130", "db": "school_scheduling", "question": "Could you provide a list of last names for all students who have completed English courses, including their quintile ranks based on their individual grades in those courses (without averaging), where the quintiles are determined by ranking students according to the number of students who have grades greater than or equal to theirs, and sorted from the highest to the lowest grade quintile?", "external_knowledge": null}
{"instance_id": "local131", "db": "EntertainmentAgency", "question": "Could you list each musical style with the number of times it appears as a 1st, 2nd, or 3rd preference in a single row per style?", "external_knowledge": null}
{"instance_id": "local133", "db": "EntertainmentAgency", "question": "Given a database of musical styles and user preferences, where each user ranks up to three preferred styles as their first, second, and third choices, assign 3 points for each first-choice ranking, 2 points for each second-choice ranking, and 1 point for each third-choice ranking. Calculate the total weighted score for each musical style that has been ranked by at least one user. Then, compute the absolute difference between each style's total weighted score and the average total weighted score across all such styles.", "external_knowledge": null}
{"instance_id": "local132", "db": "EntertainmentAgency", "question": "Show all pairs of entertainers and customers who each have up to three style strengths or preferences, where the first and second style preferences of the customers match the first and second style strengths of the entertainers (or in reverse order). Only return the entertainer’s stage name and the customer’s last name", "external_knowledge": null}
{"instance_id": "local141", "db": "AdventureWorks", "question": "How did each salesperson's annual total sales compare to their annual sales quota? Provide the difference between their total sales and the quota for each year, organized by salesperson and year.", "external_knowledge": null}
{"instance_id": "local152", "db": "imdb_movies", "question": "Can you provide the top 9 directors by movie count, including their ID, name, number of movies, average inter-movie duration (rounded to the nearest integer), average rating (rounded to 2 decimals), total votes, minimum and maximum ratings, and total movie duration? Sort the output first by movie count in descending order and then by total movie duration in descending order.", "external_knowledge": null}
{"instance_id": "local230", "db": "imdb_movies", "question": "Determine the top three genres with the most movies rated above 8, and then identify the top four directors who have directed the most films rated above 8 within those genres. List these directors and their respective movie counts.", "external_knowledge": null}
{"instance_id": "local156", "db": "bank_sales_trading", "question": "Analyze the annual average purchase price per Bitcoin by region, computed as the total dollar amount spent divided by the total quantity purchased each year, excluding the first year's data for each region. Then, for each year, rank the regions based on these average purchase prices, and calculate the annual percentage change in cost for each region compared to the previous year.", "external_knowledge": null}
{"instance_id": "local157", "db": "bank_sales_trading", "question": "Using the \"bitcoin_prices\" table, please calculate the daily percentage change in trading volume for each ticker from August 1 to August 10, 2021, ensuring that any volume ending in \"K\" or \"M\" is accurately converted to thousands or millions, any \"-\" volume is treated as zero, only non-zero volumes are used to determine the previous day's volume, and the results are ordered by ticker and date.", "external_knowledge": null}
{"instance_id": "local163", "db": "education_business", "question": "Which university faculty members' salaries are closest to the average salary for their respective ranks? Please provide the ranks, first names, last names, and salaries.university", "external_knowledge": null}
{"instance_id": "local168", "db": "city_legislation", "question": "Among job postings that specifically have the Data Analyst, require a non-null annual average salary, and are remote, what is the overall average salary when considering only the top three most frequently demanded skills for these positions?", "external_knowledge": null}
{"instance_id": "local169", "db": "city_legislation", "question": "What is the proportion of legislators who started their first term between 1917 and 1999 that remained in office on December 31st of each year, tracked annually for up to 20 years following their initial term start?", "external_knowledge": null}
{"instance_id": "local171", "db": "city_legislation", "question": "For male legislators from Louisiana, how many distinct legislators were actively serving on December 31 of each year from more than 30 years since their first term up to less than 50 years, grouping the results by the exact number of years elapsed since their first term?", "external_knowledge": null}
{"instance_id": "local167", "db": "city_legislation", "question": "Based on the state each female legislator first represented, which state has the highest number of female legislators whose terms included December 31st at any point, and what is that count? Please provide the state's abbreviation.", "external_knowledge": null}
{"instance_id": "local170", "db": "city_legislation", "question": "Identify the state abbreviations where, for both male and female legislators who began serving in that state, the retention rate of the initial cohort is greater than zero at each of the intervals 0, 2, 4, 6, 8, and 10 years after their initial term start date during the first 10 years of service.", "external_knowledge": null}
{"instance_id": "local193", "db": "sqlite-sakila", "question": "Could you find out the average percentage of the total lifetime sales (LTV) that occur in the first 7 and 30 days after a customer's initial purchase? Also, include the average total lifetime sales (LTV). Please exclude customers with zero lifetime sales. The 7- and 30-day periods should be based on the exact number of hours-minutes-seconds, not calendar days.", "external_knowledge": null}
{"instance_id": "local194", "db": "sqlite-sakila", "question": "Please provide a list of the top three revenue-generating films for each actor, along with the average revenue per actor in those films, calculated by dividing the total film revenue equally among the actors for each film.", "external_knowledge": null}
{"instance_id": "local195", "db": "sqlite-sakila", "question": "Please find out how widespread the appeal of our top five actors is. What percentage of our customers have rented films featuring these actors?", "external_knowledge": null}
{"instance_id": "local196", "db": "sqlite-sakila", "question": "For each rating category of the first movie rented by customers—where the first movie is identified based on the earliest payment date per customer—please provide the average total amount spent per customer and the average number of subsequent rentals (calculated as the total number of rentals minus one) for customers whose first rented movie falls into that rating category.", "external_knowledge": null}
{"instance_id": "local197", "db": "sqlite-sakila", "question": "Among our top 10 paying customers, can you identify the largest change in payment amounts from one month to the immediately following month? Specifically, please determine for which customer and during which month this maximum month-over-month difference occurred, and provide the difference rounded to two decimal places.", "external_knowledge": null}
{"instance_id": "local199", "db": "sqlite-sakila", "question": "Can you identify the year and month with the highest rental orders created by the store's staff for each store? Please list the store ID, the year, the month, and the total rentals for those dates.", "external_knowledge": null}
{"instance_id": "local201", "db": "modern_data", "question": "Identify the first 10 words, sorted alphabetically, that are 4 to 5 characters long, start with 'r', and have at least one anagram of the same length, considering case-sensitive letters. Provide the count of such anagrams for each word.", "external_knowledge": null}
{"instance_id": "local202", "db": "city_legislation", "question": "For alien data, how many of the top 10 states by alien population have a higher percentage of friendly aliens than hostile aliens, with an average alien age exceeding 200?", "external_knowledge": null}
{"instance_id": "local209", "db": "delivery_center", "question": "In the dataset of orders joined with store information, which store has the highest total number of orders, and among that store’s orders, what is the ratio of orders that appear in the deliveries table with a 'DELIVERED' status to the total orders for that store?", "external_knowledge": null}
{"instance_id": "local210", "db": "delivery_center", "question": "Can you identify the hubs that saw more than a 20% increase in finished orders from February to March?", "external_knowledge": null}
{"instance_id": "local212", "db": "delivery_center", "question": "Can you find 5 delivery drivers with the highest average number of daily deliveries?", "external_knowledge": null}
{"instance_id": "local218", "db": "EU_soccer", "question": "Can you calculate the median from the highest season goals of each team?", "external_knowledge": null}
{"instance_id": "local219", "db": "EU_soccer", "question": "In each league, considering all seasons, which single team has the fewest total match wins based on comparing home and away goals, including teams with zero wins, ensuring that if multiple teams tie for the fewest wins, only one team is returned for each league?", "external_knowledge": null}
{"instance_id": "local221", "db": "EU_soccer", "question": "Tell me top10 teams with the most wins across the league", "external_knowledge": null}
{"instance_id": "local220", "db": "EU_soccer", "question": "Which player has participated in the highest number of winning matches and which player has participated in the highest number of losing matches, considering only matches where they actually played (excluding null entries) and where their team won or lost (excluding draws)?", "external_knowledge": null}
{"instance_id": "local228", "db": "IPL", "question": "For each IPL season, identify the top three batsmen with the highest total runs scored and the top three bowlers with the most wickets taken, excluding ‘run out’, ‘hit wicket’, and ‘retired hurt’ dismissals. In the event of ties in runs or wickets, break the tie using the smaller player ID. Then output these six players in matched positions—batsman 1 with bowler 1, batsman 2 with bowler 2, and batsman 3 with bowler 3—in ascending order of the season ID, along with each player’s total runs or wickets.", "external_knowledge": null}
{"instance_id": "local229", "db": "IPL", "question": "Find the IDs of players who scored the highest number of partnership runs for each match. The output should include the IDs of two players, each with their individual scores and the total partnership score. For each pair, the player with the higher individual score should be listed as player 1, and the player with the lower score as player 2. In cases where both players have the same score, the player with the higher ID should be player 1, and the player with the lower ID should be player 2. There can be multiple rows for a single match.", "external_knowledge": null}
{"instance_id": "local244", "db": "music", "question": "Calculate the duration of each track, classify them as short, medium, or long, output the minimum and maximum time for each kind (in minutes) and the total revenue for each category, group by the category.", "external_knowledge": "music_length_type.md"}
{"instance_id": "local253", "db": "education_business", "question": "Using a Salary Dataset where the salary values need to be cleaned by removing non-numeric characters and converting them to a numeric type, write a detailed SQL query that identifies the top 5 companies by average salary in each of Mumbai, Pune, New Delhi, and Hyderabad, then compares each company’s average salary in those cities to the overall national average salary. The final result should display four columns: Location, Company Name, Average Salary in State, and Average Salary in Country, listing only the top 5 companies in each of the specified locations.", "external_knowledge": null}
{"instance_id": "local258", "db": "IPL", "question": "Calculate the total number of wickets taken by each bowler (excluding run-outs and other dismissals not attributed to the bowler), their economy rate (total runs conceded divided by total overs bowled, considering only runs scored off the bat and ignoring any extra runs like wides and no-balls), their strike rate (average number of balls bowled per wicket taken), and their best bowling performance in a single match (the match with the most wickets taken by the bowler, formatted as \"wickets-runs\" where runs are the runs conceded excluding extras).", "external_knowledge": "baseball_game_special_words_definition.md"}
{"instance_id": "local259", "db": "IPL", "question": "For each player, list their ID, name, their most frequent role across all matches, batting hand, bowling skill, total runs scored, total matches played, total times they were dismissed, batting average (total runs divided by total dismissals), highest score in a single match, the number of matches in which they scored at least 30 runs, at least 50 runs, and at least 100 runs, total balls faced in their career, strike rate (total runs divided by total balls faced, multiplied by 100), total wickets taken, economy rate (average runs conceded per over), and their best bowling performance in a single match (most wickets taken in a match, formatted as \"wickets taken-runs given\", where the best performance is the one with the most wickets, and if tied, the fewest runs conceded). Ignore the extra runs data.", "external_knowledge": "baseball_game_special_words_definition.md"}
{"instance_id": "local262", "db": "stacking", "question": "Which problems exceed the total number of times they appear in the solution table when counting all occurrences, across steps 1, 2, and 3, where any non-\"Stack\" model's maximum test score is lower than the \"Stack\" model's test score for the same step and version?", "external_knowledge": null}
{"instance_id": "local263", "db": "stacking", "question": "Identify the L1_model associated with each model (specified by name and version) that occurs most frequently for each status ('strong' or 'soft'), along with the number of times it occurs. A model has a 'strong' status if, for any of its steps, the maximum test score among non-'Stack' models is less than the 'Stack' model's test score. It has a 'soft' status if the maximum test score among non-'Stack' models equals the 'Stack' model's test score. Count how many times each L1_model is associated with a 'strong' or 'soft' status across all models, and determine which L1_model has the highest occurrence for each status.", "external_knowledge": null}
{"instance_id": "local264", "db": "stacking", "question": "Which model category (L1_model) appears the most frequently across all steps and versions when comparing traditional models to the Stack model, and what is the total count of its occurrences?", "external_knowledge": null}
{"instance_id": "local269", "db": "oracle_sql", "question": "What is the average total quantity across all final packaging combinations, considering only the leaf-level items within each combination after fully expanding any nested packaging relationships?", "external_knowledge": null}
{"instance_id": "local270", "db": "oracle_sql", "question": "Which top-level packaging containers, meaning those not contained within any other packaging, have any item for which the total quantity accumulated across all nested levels in the hierarchy exceeds 500, and what are the names of both these containers and the corresponding items?", "external_knowledge": null}
{"instance_id": "local272", "db": "oracle_sql", "question": "For order 423, identify the product IDs, aisles, and positions from which to pick the exact quantities needed for each order line, ensuring that the total picked quantity for each product matches the cumulative quantities ordered without exceeding the available inventory in warehouse 1. Calculate the quantities to be picked from each location by prioritizing inventory with earlier purchased dates and smaller quantities, and ensure that picking respects the sequence and cumulative quantities of the order lines for products with multiple entries.", "external_knowledge": null}
{"instance_id": "local273", "db": "oracle_sql", "question": "Calculate the average pick percentage for each product name, using a first-in-first-out approach that selects from inventory locations based on the earliest purchase date and smallest available quantity, ensuring that the picked quantity reflects only the overlapping range between each order’s required quantity and the inventory’s available quantity, and then grouping and ordering the results by product name?", "external_knowledge": null}
{"instance_id": "local274", "db": "oracle_sql", "question": "Which products were picked for order 421, and what is the average number of units picked for each product, using FIFO (First-In, First-Out) method?", "external_knowledge": null}
{"instance_id": "local275", "db": "oracle_sql", "question": "Based on monthly sales data starting in January 2016 and using a centered moving average to adjust for seasonality, which products had a seasonality-adjusted sales ratio that stayed consistently above 2 for every month in the year 2017?", "external_knowledge": "calculation_method.md"}
{"instance_id": "local277", "db": "oracle_sql", "question": "What is the average forecasted annual sales for products 4160 and 7790 during 2018, using monthly sales data starting from January 2016 for the first 36 months, applying seasonality adjustments from time steps 7 through 30, and employing a weighted regression method to estimate sales?", "external_knowledge": "calculation_method.md"}
{"instance_id": "local279", "db": "oracle_sql", "question": "Using a recursive monthly inventory adjustment model starting from December 2018 inventory levels, where we restock a product if its ending inventory drops below the minimum required level, determine for each product the month in 2019 where the absolute difference between its ending inventory and the minimum required level is the smallest, and return the product_id, that month, and the absolute difference.", "external_knowledge": null}
{"instance_id": "local283", "db": "EU_soccer", "question": "Analyze the soccer match dataset to determine the champion team for each season across all countries and leagues, awarding 3 points for every win, 1 point for every tie, and 0 points for every loss. For each season, return the champion’s team name, the league, the country, and the total points accumulated.", "external_knowledge": null}
{"instance_id": "local284", "db": "bank_sales_trading", "question": "For veg whsle data, can you generate a summary of our items' loss rates? Include the average loss rate, and also break down the count of items that are below, above, and within one standard deviation from this average.", "external_knowledge": null}
{"instance_id": "local285", "db": "bank_sales_trading", "question": "For veg whsle data, can you analyze our financial performance over the years 2020 to 2023? I need insights into the average wholesale price, maximum wholesale price, minimum wholesale price, wholesale price difference, total wholesale price, total selling price, average loss rate, total loss, and profit for each category within each year. Round all calculated values to two decimal places.", "external_knowledge": null}
{"instance_id": "local286", "db": "electronic_sales", "question": "Prepare a comprehensive performance report on our sellers, focusing on total sales, average item price, average review scores, and packing times. Ensure that the report includes only those sellers who have sold a quantity of more than 100 products and highlight the product category names in English with the highest sales volume.", "external_knowledge": null}
{"instance_id": "local301", "db": "bank_sales_trading", "question": "For weekly-sales data, I need an analysis of our sales performance around mid-June for the years 2018, 2019, and 2020. Specifically, calculate the percentage change in sales between the four weeks leading up to June 15 and the four weeks following June 15 for each year.", "external_knowledge": null}
{"instance_id": "local302", "db": "bank_sales_trading", "question": "Analyze the average percentage change in sales between the 12 weeks before and after June 15, 2020, for each attribute type: region, platform, age band, demographic, and customer type. For each attribute type, calculate the average percentage change in sales across all its attribute values. Identify the attribute type with the highest negative impact on sales and provide its average percentage change in sales.", "external_knowledge": null}
{"instance_id": "local329", "db": "log", "question": "How many unique sessions visited the /regist/input page and then the /regist/confirm page, in that order?", "external_knowledge": null}
{"instance_id": "local330", "db": "log", "question": "Using only the data from the log table, compute for each web page the number of unique user sessions where that page is either the first landing page or the exit page of the session based on the timestamp.", "external_knowledge": null}
{"instance_id": "local331", "db": "log", "question": "Which three distinct third-page visits are most frequently observed immediately after two consecutive visits to the '/detail' page, and how many times does each third-page visit occur?", "external_knowledge": null}
{"instance_id": "local358", "db": "log", "question": "How many users are there in each age category (20s, 30s, 40s, 50s, and others)?", "external_knowledge": null}
{"instance_id": "local360", "db": "log", "question": "Identify the sessions with the minimal number of events occurring before the first '/detail' click or '/complete' conversion within the session, considering only events with non-empty search types. If multiple sessions share this minimum count, include all of them. For each session, display the associated paths and search types.", "external_knowledge": null}
{"instance_id": "local344", "db": "f1", "question": "Considering all races where pit stop data is available, and focusing on instances when a driver was not behind another car on the previous lap but is behind on the current lap (accounting for retirements, pit-stop entries, pit-stop exits, and race starts), how many times has each type of overtake occurred in Formula 1?", "external_knowledge": "f1_overtake.md"}
{"instance_id": "local336", "db": "f1", "question": "In the first five laps of the race, how many overtakes occurred in each category—retirements, pit stops, start-related overtakes, and standard on-track passes?", "external_knowledge": "f1_overtake.md"}
{"instance_id": "local335", "db": "f1", "question": "In Formula 1 seasons since 2001, considering only drivers who scored points in a season, which five constructors have had the most seasons where their drivers scored the fewest total points among all point-scoring drivers in that season?", "external_knowledge": null}
{"instance_id": "local309", "db": "f1", "question": "For each year, which driver and which constructor scored the most points? I want the full name of each driver.", "external_knowledge": null}
{"instance_id": "local310", "db": "f1", "question": "Using only the data from the ‘results’ table, find the three years in which the sum of the highest total points earned by any driver and the highest total points earned by any constructor in that year (both calculated by summing up points from the ‘results’ table) is smallest, and list those three years in order of ascending total.", "external_knowledge": null}
{"instance_id": "local311", "db": "f1", "question": "Which constructors had the top 3 combined points from their best driver and team, and in which years did they achieve them?", "external_knowledge": null}
{"instance_id": "local354", "db": "f1", "question": "Among Formula 1 drivers who raced during the 1950s, which drivers completed a season in that decade with the same constructor in both the first and the last race they participated in, while also taking part in at least two distinct race rounds during that season?", "external_knowledge": null}
{"instance_id": "local355", "db": "f1", "question": "Calculate the average first and last rounds of races missed by drivers each year. Only include drivers who missed fewer than three races annually and who switched teams between the race immediately before their first missed race and the race immediately after their last missed race.", "external_knowledge": null}
{"instance_id": "local356", "db": "f1", "question": "Provide the full names of drivers who have been overtaken on track more times than they have overtaken others on track during race laps, excluding position changes due to pit stops, retirements, or movements at the start of the race.", "external_knowledge": null}
{"instance_id": "sf001", "db": "GLOBAL_WEATHER__CLIMATE_DATA_FOR_BI", "question": "Assuming today is April 1, 2024, I would like to know the daily snowfall amounts greater than 6 inches for each U.S. postal code during the week ending after the first two full weeks of the previous year. Show the postal code, date, and snowfall amount.", "external_knowledge": null}
{"instance_id": "sf003", "db": "GLOBAL_GOVERNMENT", "question": "From 2015 to 2020, which zip code in Census Zip Code Tabulation Areas had the second-highest annual population growth rate, given a minimum estimate of 25,000 people over a 5-year period? Include the zip code, state abbreviation, and growth rate.", "external_knowledge": null}
{"instance_id": "sf002", "db": "FINANCE__ECONOMICS", "question": "As of December 31, 2022, list the top 10 active banks with assets exceeding $10 billion, ranked by the highest percentage of uninsured assets, where the percentage is calculated as one minus the value of the '% Insured (Estimated)' variable from quarterly estimates. Provide the names of these banks and their respective percentages of uninsured assets.", "external_knowledge": null}
{"instance_id": "sf044", "db": "FINANCE__ECONOMICS", "question": "What was the percentage change in post-market close prices for the Magnificent 7 tech companies from January 1 to June 30, 2024?", "external_knowledge": null}
{"instance_id": "sf006", "db": "FINANCE__ECONOMICS", "question": "For each U.S. state, find how the number of active financial branch entities has changed from March 1, 2020, to December 31, 2021. An entity is considered active on a specific date if its start date is on or before that date and its end date is either null or on or after that date. For each state, calculate the number of entities active on March 1, 2020, the number of entities active on December 31, 2021, and the percentage change in these counts", "external_knowledge": null}
{"instance_id": "sf008", "db": "US_REAL_ESTATE", "question": "Determine the percentage change in gross income inflow and the seasonally-adjusted purchase-only home price index for the Phoenix-Mesa-Scottsdale, AZ Metro Area from January 1, 2023, to December 31, 2023. Gross income inflow refers to the total adjusted gross income from all financial entities within the specified metro area", "external_knowledge": null}
{"instance_id": "sf010", "db": "US_REAL_ESTATE", "question": "What are the cumulative ratios of mortgages near default in California for each recorded date in 2023, including those that are 90 to 180 days past due, in forbearance, or undergoing foreclosure, bankruptcy, or deed-in-lieu processes?", "external_knowledge": null}
{"instance_id": "sf037", "db": "US_REAL_ESTATE", "question": "How can we calculate the shortest straight-line distance in miles between each 'The Home Depot' store, identified by its POI ID, and its nearest 'Lowe's Home Improvement' store using the geographic coordinates and ensuring we only return the minimal distance for each 'The Home Depot' location?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf012", "db": "WEATHER__ENVIRONMENT", "question": "For each year from 2010 through 2019, what were the total building damage amounts and total contents damage amounts reported under the National Flood Insurance Program for the NFIP community named 'City Of New York,' grouped by each year of loss?", "external_knowledge": null}
{"instance_id": "sf018", "db": "BRAZE_USER_EVENT_DEMO_DATASET", "question": "Examine user engagement with push notifications within a specified one-hour window on June 1, 2023.", "temporal": "Yes", "external_knowledge": "PushNotificationAnalysis.md"}
{"instance_id": "sf035", "db": "BRAZE_USER_EVENT_DEMO_DATASET", "question": "How many unique users started sessions each day within each app group between June 1, 2023, and June 7, 2023? Also show the app group ID and the start day of the session.", "external_knowledge": null}
{"instance_id": "sf029", "db": "AMAZON_VENDOR_ANALYTICS__SAMPLE_DATASET", "question": "Generate a daily detailed sales report for each product under the 'Manufacturing' distributor view, covering the 30 days leading up to February 6, 2022, by joining the sales, traffic, inventory, and net PPM data on date, ASIN, program, period, and distributor_view. The report must include total ordered units, ordered revenue, average selling price, glance views, conversion rate, shipped units, shipped revenue, average net PPM, average procurable product OOS, total on-hand units and value, net received units and value, open purchase order quantities, unfilled customer ordered units, and average vendor confirmation rate, receive fill rate, sell-through rate, and vendor lead time.", "external_knowledge": null}
{"instance_id": "sf040", "db": "US_ADDRESSES__POI", "question": "Find the top 10 northernmost addresses in Florida's largest zip code area. What are their address numbers, street names, and types?", "temporal": "Yes", "external_knowledge": null}
{"instance_id": "sf009", "db": "NETHERLANDS_OPEN_MAP_DATA", "question": "A real estate company is looking for a comparison of the building types in Amsterdam and Rotterdam. They need to know the total surface area and the number of buildings for each type of building in both cities. Can you provide the building class and subclass, along with the total surface area and the number of buildings for both Amsterdam and Rotterdam?", "external_knowledge": null}
{"instance_id": "sf013", "db": "NETHERLANDS_OPEN_MAP_DATA", "question": "Determine the total length of roads for each class and subclass in Amsterdam and Rotterdam, based on specific QUADKEY segments '12020210' and '12020211'? Show the class, subclass, and total road lengths for both cities", "external_knowledge": null}
{"instance_id": "sf041", "db": "YES_ENERGY__SAMPLE_DATA", "question": "Produce a report for ERCOT on October 1, 2022, that combines hourly data on day-ahead and real-time prices from node ID 10000697078, load forecasts (datatypeid 19060) and actual loads, plus wind (forecast datatypeid 9285, actual datatypeid 16) and solar (forecast datatypeid 662, actual datatypeid 650) generation forecasts and actuals from object ID 10000712973. This report should include time zone alignments, peak classifications, and net load calculations, providing insights into daily operational dynamics and efficiency.", "temporal": "Yes", "external_knowledge": "ERCOT_Daily_Market_Dynamics_Report.md"}
{"instance_id": "sf011", "db": "CENSUS_GALAXY__ZIP_CODE_TO_BLOCK_GROUP_SAMPLE", "question": "Determine the population distribution within each block group relative to its census tract in New York State using 2021 ACS data. Include block group ID, census value, state county tract ID, total tract population, and the population ratio of each block group.", "external_knowledge": null}
{"instance_id": "sf014", "db": "CENSUS_GALAXY__AIML_MODEL_DATA_ENRICHMENT_SAMPLE", "question": "What is the New York State ZIP code with the highest number of commuters traveling over one hour, according to 2021 ACS data? Include the zip code, the total commuters, state benchmark for this duration, and state population.", "external_knowledge": null}