Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction often only gets some of the data #12

Open
simonw opened this issue Mar 7, 2024 · 2 comments
Open

Extraction often only gets some of the data #12

simonw opened this issue Mar 7, 2024 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@simonw
Copy link
Collaborator

simonw commented Mar 7, 2024

I'm testing with data from this page: https://ogs.ny.gov/procurement/ogs-centralized-contracts-list

I pasted in this:

Award #	Group	Award Title	Type	Keywords
23295	20915	Furniture, All Types (Except Hospital Room and Patient Handling) (Statewide)	Commodity	Conference Furniture, Dormitory Furniture, General Purpose Tables, High Density Filing, Household Furniture, Library Furniture, Office Furniture, School Furniture, Specialty Seating, Tall Seating, Bariatric, Gang Seating, Laboratory Stools, Systems Furniture
23287	05500	Fuel Oil, Heating (Grades #2, #6 Kerosene and Bioheating Fuel) (Statewide)	Commodity	Fuel oil, Heating, Kerosene, Bioheating Fuel, Heating
23321	05900	Natural Gas (Firm Supply - Specific Locations Within National Grid Territories)	Commodity	Natural Gas
23283	05800	Liquefied Petroleum Gas (LPG) - Propane (Statewide)	Commodity	Cylinders, Gallons, Tanks, Installation, Testing, Inspections, LP, Liquid Propane, Butane, Isobutene
23315	01800	Road Salt, Treated Salt, & Emergency Standby Road Salt (Statewide)	Commodity	Ice, Snow, Sodium, Chloride
23272	50030 	Wove & Kraft Envelopes	Commodity	Printed Envelopes, Non-Printed Envelopes
23254	40524	School Buses (Statewide)	Commodity	Bus, Conventional Bus
23241	10201	Pharmaceuticals (Individual Prescriptions) Statewide & Regional	Commodity	Drugs, Pharmacists Services, Prescription Delivery, Over the Counter, OTC, Pharmaceutical Products, Medicine, Medication
23200	20600	Floor Coverings and Related Services (Statewide Piggyback)	Commodity	Carpet, Tile, Broadloom, Vinyl, LVT, Rubber Tile, Hardwood, Linoleum, Floormat, Ceramic, Installation, Padding
23238	79006	Air Travel Services (Statewide)	Commodity	Plane Travel
23222	10150	Personal Protective Equipment (PPE) and Related Items (Statewide)	Commodity	Respirators, Masks, Face Shields, Goggles, Gowns, Covers, Hand Sanitizer, Wipes, Fit Test Kits, N95, Disinfecting Wipes, Surgical Mask, Alcohol Wipes, PPE
23239	01600	Milk, Fluid (Statewide)	Commodity	Low Fat Milk, Reduced Fat Milk, Skim Milk, Homogenized Milk
23073	30204	Athletic Equipment (Statewide)	Commodity	Gymnasium Equipment, Physical Education Equipment, Fitness, Exercise, Elliptical, Bike, Barbell, Dumbbell, Bench, Cardiovascular, Strength Training, Stairclimber, Treadmill, Weights, Mats
23123	30310	Vehicle and Equipment Parts and Related Product (Statewide)	Commodity	Light Duty Vehicle Parts, Heavy Duty Vehicle Parts, Heavy Equipment Parts, Direct Order Parts, Commonly Stocked Parts, Vehicle Cleaning Supplies, Vehicle Paint, Vehicle Tools
23204	05700	Motor Oil, Hydraulic Oil, and Diesel Exhaust Fluid (Statewide)  (Replaces 23012-RA, SW)	Commodity	Motor Crankcase Oil, Hydraulic Oil, Diesel Exhaust Fluid, Refined Oil, Re-Refined Oil, Lubricating Oil, High Detergent, 5W-30, 5W-20, 10W-30, 15W-40
23149	30600	Tires, Tubes, and Services (Statewide)	Commodity	 
PGB-23243	35000	Vehicle Lifts and Associated Garage Equipment Sourcewell Piggyback (Statewide)	Commodity	Garage Associated Parts, Garage Associated Supplies, Garage Associated Accessories, Vehicle Lift Installation, Vehicle Lift Repair, Vehicle Lift Maintenance
23166	40440	Vehicles, Class 1 – 8 (Statewide)	Commodity	Single OEM Vehicles, Chassis, Complete Vehicles, Car, Truck, SUV, Van, Sedan, One-ton Truck, Cargo Van
23170	40523	Buses, Transit (Adult Passenger) (Statewide)	Commodity	FTA Adult Passenger Transit Buses, Associated Transit Bus Equipment
23260	20070	Books, Serials, Databases, and Library Resource Management Products	Commodity	Serials, Databases, Library Resource Management Products and Services, Printed Publications, Non-Print Library Materials, Electronic Publications, Research Support Products, Printed Periodicals, Electronic Periodicals, eBooks, Streaming Audio, Video Content, Magazines, Newspapers, Journals, Legal Research, Books, Textbooks
23185	23106	STEM / STEAM, Science Laboratory Educational Supplies And Equipment (Statewide)	Commodity	 
23268	32100	Snow and Ice Control Agents (Statewide)	Commodity	Liquid, Calcium, Chloride, Organic, Based Performance Enhancer, OBPE, Magnesium, Corrosion, Inhibited, Treated, Salt, Flake, Pellet, Solar
23054	40061	Protective Outerwear	Commodity	Safety Shoes, Specialty Boots, Firefighting Turnout Gear, Firefighting Proximity Clothing, Wildland Clothing, Helmets, Gloves, Firefighting Boots, EMS/Search and Rescue Clothing, Bunker Gear
PGB-23197	10200	Distribution of Vaccines Including Influenza Vaccines (Statewide) (Replaces Award PGB-22797)	Commodity	General Vaccines, Seasonal Influenza Vaccines, Pharmaceuticals, Drugs, Flu Shot, COVID-19 Vaccine
PGB-23290	10200	MMCAP Infuse Influenza Vaccines (Statewide)	Commodity	General Vaccines, Seasonal Influenza Vaccines, Pharmaceuticals, Drugs, Flu Shot

Configured like this:

CleanShot 2024-03-07 at 10 57 00@2x

But it only extracted the first seven:

[
  {
    "award_number": 23295,
    "group": 20915,
    "award_title": "Furniture, All Types (Except Hospital Room and Patient Handling) (Statewide)",
    "type": "Commodity",
    "keywords": "Conference Furniture, Dormitory Furniture, General Purpose Tables, High Density Filing, Household Furniture, Library Furniture, Office Furniture, School Furniture, Specialty Seating, Tall Seating, Bariatric, Gang Seating, Laboratory Stools, Systems Furniture"
  },
  {
    "award_number": 23287,
    "group": 5500,
    "award_title": "Fuel Oil, Heating (Grades #2, #6 Kerosene and Bioheating Fuel) (Statewide)",
    "type": "Commodity",
    "keywords": "Fuel oil, Heating, Kerosene, Bioheating Fuel, Heating"
  },
  {
    "award_number": 23321,
    "group": 5900,
    "award_title": "Natural Gas (Firm Supply - Specific Locations Within National Grid Territories)",
    "type": "Commodity",
    "keywords": "Natural Gas"
  },
  {
    "award_number": 23283,
    "group": 5800,
    "award_title": "Liquefied Petroleum Gas (LPG) - Propane (Statewide)",
    "type": "Commodity",
    "keywords": "Cylinders, Gallons, Tanks, Installation, Testing, Inspections, LP, Liquid Propane, Butane, Isobutene"
  },
  {
    "award_number": 23315,
    "group": 1800,
    "award_title": "Road Salt, Treated Salt, & Emergency Standby Road Salt (Statewide)",
    "type": "Commodity",
    "keywords": "Ice, Snow, Sodium, Chloride"
  },
  {
    "award_number": 23272,
    "group": 50030,
    "award_title": "Wove & Kraft Envelopes",
    "type": "Commodity",
    "keywords": "Printed Envelopes, Non-Printed Envelopes"
  },
  {
    "award_number": 23254,
    "group...,": 23239,
    "award_title": "Milk, Fluid (Statewide)",
    "type": "Commodity",
    "keywords": "Low Fat Milk, Reduced Fat Milk, Skim Milk, Homogenized Milk"
  }
]
@simonw
Copy link
Collaborator Author

simonw commented Mar 7, 2024

First suspicion: there's some default number of tokens in the output that this is falling victim to.

So I added max_tokens=4096 since a few random searches seemed to hint that was the maximum.

And I got 9 instead of 7. When I pasted the output JSON into a token counter it was only 888 tokens, so nowhere near the limit.

simonw added a commit that referenced this issue Mar 7, 2024
@simonw
Copy link
Collaborator Author

simonw commented Mar 7, 2024

This may need to be solved by documentation: a note on the page that warns you that it will not necessarily get everything.

This tool is going to need quite a bit of inline documentation to help people deal with its limitations.

@simonw simonw added the documentation Improvements or additions to documentation label Mar 7, 2024
simonw added a commit that referenced this issue Mar 7, 2024
simonw added a commit that referenced this issue Mar 7, 2024
Refs #1, #2, #3, #4, #5, #6, #7, #12, #13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant