TabPFNRegressor preprocessing fails on bigger datasets #169

LeoGrin · 2025-02-04T17:48:36Z

See https://huggingface.co/Prior-Labs/TabPFN-v2-reg/discussions/2
It seems that QuantileTransformer fails on big datasets with message The number of quantiles cannot be greater than the number of samples used, which means TabPFN is unusable for these bigger datasets even with ignore_pretraining_contraints=True. Seems to only happen on regression? (not sure)

The text was updated successfully, but these errors were encountered:

LeoGrin · 2025-02-04T20:48:31Z

In the preprocessing QuantileTransformers, we set the number of quantiles to be num_examples // 10 or num_examples // 5, which means that it should be lower than the number of samples, but the subsample parameter is unchanged from default 10K, which can be lower than the number of quantiles when the number of sample is large. We can either:

limit the number of quantiles to 10K, or
set the subsample really high.

I quickly checked the time cost of the second option with this test:

import numpy as np
import time
from sklearn.preprocessing import QuantileTransformer

def test_quantile_transformer_speed():
    # Use a dataset with many samples so that default subsampling is active.
    n_samples = 200_000  # more than the default subsample limit (typically 100000)
    n_features = 100
    n_quantiles = 10_000
    X = np.random.rand(n_samples, n_features)

    n_runs = 5
    default_times = []
    large_times = []

    for run in range(n_runs):
        print(f"\nRun {run + 1}/{n_runs}")
        
        # Test with default settings
        print("Testing QuantileTransformer with default subsample parameter")
        qt_default = QuantileTransformer(random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_default = qt_default.fit_transform(X)
        X_trans_default_2 = qt_default.transform(X)
        t1 = time.perf_counter()
        default_time = t1 - t0
        default_times.append(default_time)
        print(f"Default QuantileTransformer fit_transform time: {default_time:.6f} sec")
        print("Transformed shape:", X_trans_default.shape)

        # Test with subsample explicitly set
        print("\nTesting QuantileTransformer with subsample=100_000")
        qt_large = QuantileTransformer(subsample=100_000, random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_large = qt_large.fit_transform(X)
        X_trans_large_2 = qt_large.transform(X)
        t1 = time.perf_counter()
        large_time = t1 - t0
        large_times.append(large_time)
        print(f"QuantileTransformer (subsample=100_000) fit_transform time: {large_time:.6f} sec")
        print("Transformed shape:", X_trans_large.shape)

    # Print summary statistics
    print("\nSummary Statistics:")
    print(f"Default QuantileTransformer:")
    print(f"  Average time: {np.mean(default_times):.6f} sec")
    print(f"  Std dev: {np.std(default_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in default_times]}")
    
    print(f"\nQuantileTransformer (subsample=100_000):")
    print(f"  Average time: {np.mean(large_times):.6f} sec")
    print(f"  Std dev: {np.std(large_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in large_times]}")

if __name__ == '__main__':
    test_quantile_transformer_speed()

And got:

Summary Statistics:
Default QuantileTransformer:
  Average time: 7.082734 sec
  Std dev: 0.044457 sec
  Times: ['7.070789', '7.033962', '7.103202', '7.047488', '7.158230']

QuantileTransformer (subsample=100_000):
  Average time: 5.735545 sec
  Std dev: 0.040030 sec
  Times: ['5.678141', '5.717424', '5.721183', '5.772044', '5.788931']

so surprisingly increasing the subsample seems to be a bit faster 🤔

(for 1K quantiles I get

Summary Statistics:
Default QuantileTransformer:
  Average time: 4.122261 sec
  Std dev: 0.044430 sec
  Times: ['4.209649', '4.111263', '4.086600', '4.104299', '4.099495']

QuantileTransformer (subsample=100_000):
  Average time: 4.494478 sec
  Std dev: 0.045392 sec
  Times: ['4.579021', '4.465794', '4.501846', '4.474663', '4.451065']

@noahho would you have an opinion among both options, and on whether changing this parameter after training might be an issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TabPFNRegressor preprocessing fails on bigger datasets #169

TabPFNRegressor preprocessing fails on bigger datasets #169

LeoGrin commented Feb 4, 2025

LeoGrin commented Feb 4, 2025

TabPFNRegressor preprocessing fails on bigger datasets #169

TabPFNRegressor preprocessing fails on bigger datasets #169

Comments

LeoGrin commented Feb 4, 2025

LeoGrin commented Feb 4, 2025