Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TabPFNRegressor preprocessing fails on bigger datasets #169

Open
LeoGrin opened this issue Feb 4, 2025 · 1 comment
Open

TabPFNRegressor preprocessing fails on bigger datasets #169

LeoGrin opened this issue Feb 4, 2025 · 1 comment

Comments

@LeoGrin
Copy link
Collaborator

LeoGrin commented Feb 4, 2025

See https://huggingface.co/Prior-Labs/TabPFN-v2-reg/discussions/2
It seems that QuantileTransformer fails on big datasets with message The number of quantiles cannot be greater than the number of samples used, which means TabPFN is unusable for these bigger datasets even with ignore_pretraining_contraints=True. Seems to only happen on regression? (not sure)

@LeoGrin
Copy link
Collaborator Author

LeoGrin commented Feb 4, 2025

In the preprocessing QuantileTransformers, we set the number of quantiles to be num_examples // 10 or num_examples // 5, which means that it should be lower than the number of samples, but the subsample parameter is unchanged from default 10K, which can be lower than the number of quantiles when the number of sample is large. We can either:

  • limit the number of quantiles to 10K, or
  • set the subsample really high.

I quickly checked the time cost of the second option with this test:

import numpy as np
import time
from sklearn.preprocessing import QuantileTransformer

def test_quantile_transformer_speed():
    # Use a dataset with many samples so that default subsampling is active.
    n_samples = 200_000  # more than the default subsample limit (typically 100000)
    n_features = 100
    n_quantiles = 10_000
    X = np.random.rand(n_samples, n_features)

    n_runs = 5
    default_times = []
    large_times = []

    for run in range(n_runs):
        print(f"\nRun {run + 1}/{n_runs}")
        
        # Test with default settings
        print("Testing QuantileTransformer with default subsample parameter")
        qt_default = QuantileTransformer(random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_default = qt_default.fit_transform(X)
        X_trans_default_2 = qt_default.transform(X)
        t1 = time.perf_counter()
        default_time = t1 - t0
        default_times.append(default_time)
        print(f"Default QuantileTransformer fit_transform time: {default_time:.6f} sec")
        print("Transformed shape:", X_trans_default.shape)

        # Test with subsample explicitly set
        print("\nTesting QuantileTransformer with subsample=100_000")
        qt_large = QuantileTransformer(subsample=100_000, random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_large = qt_large.fit_transform(X)
        X_trans_large_2 = qt_large.transform(X)
        t1 = time.perf_counter()
        large_time = t1 - t0
        large_times.append(large_time)
        print(f"QuantileTransformer (subsample=100_000) fit_transform time: {large_time:.6f} sec")
        print("Transformed shape:", X_trans_large.shape)

    # Print summary statistics
    print("\nSummary Statistics:")
    print(f"Default QuantileTransformer:")
    print(f"  Average time: {np.mean(default_times):.6f} sec")
    print(f"  Std dev: {np.std(default_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in default_times]}")
    
    print(f"\nQuantileTransformer (subsample=100_000):")
    print(f"  Average time: {np.mean(large_times):.6f} sec")
    print(f"  Std dev: {np.std(large_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in large_times]}")

if __name__ == '__main__':
    test_quantile_transformer_speed() 

And got:

Summary Statistics:
Default QuantileTransformer:
  Average time: 7.082734 sec
  Std dev: 0.044457 sec
  Times: ['7.070789', '7.033962', '7.103202', '7.047488', '7.158230']

QuantileTransformer (subsample=100_000):
  Average time: 5.735545 sec
  Std dev: 0.040030 sec
  Times: ['5.678141', '5.717424', '5.721183', '5.772044', '5.788931']

so surprisingly increasing the subsample seems to be a bit faster 🤔

(for 1K quantiles I get

Summary Statistics:
Default QuantileTransformer:
  Average time: 4.122261 sec
  Std dev: 0.044430 sec
  Times: ['4.209649', '4.111263', '4.086600', '4.104299', '4.099495']

QuantileTransformer (subsample=100_000):
  Average time: 4.494478 sec
  Std dev: 0.045392 sec
  Times: ['4.579021', '4.465794', '4.501846', '4.474663', '4.451065']

@noahho would you have an opinion among both options, and on whether changing this parameter after training might be an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant