Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypass shorter predictions by vw #7

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Bypass shorter predictions by vw #7

wants to merge 1 commit into from

Conversation

ivan-pavlov
Copy link

I think I found a bug in external vw code.

Sometimes shorter predictions file is generated when neural network model is used (specified by --nn argument) on MacOS 10.13.1 and probably other platforms.
This causes an error in roc function from pROC package and can be seen in vw_example.R demo file.

While there is no solution for such behavior right now it may be useful to add some bypass solution.

@eddelbuettel
Copy link
Collaborator

That looks good to me, thanks for catching that! I may not use the --nn flag much...

The package does not even have a tests/ directory but can you think of a way to demonstrate the 'before' and 'after' (ie how your PR helps catch this) ? Is it worth a file in demo/ ?

@ivan-pavlov
Copy link
Author

I don't think this worth a separate demo file.
In my opinion, it can be included in vw_example.R file

library(ggplot2) 
library(rvw)
library(data.table)

Loading and preparing data from vw_example.R

get_feature_type <- function(X, threshold = 50, verbose = FALSE) {
    q_levels <- function (x) {
        if (data.table::is.data.table(x)) {
            unlist(x[, lapply(.SD, function(x) length(unique(x)))])
        } else {
            apply(x, 2, function(x) length(unique(x)))
        }
    }

    lvs = q_levels(X)
    fact_vars = names(lvs[lvs < threshold])
    num_vars = names(lvs[lvs >= threshold])
    if (verbose) {
        print(data.frame(lvs))
    }
    list(fact_vars = fact_vars, num_vars = num_vars)
}

dt <- diamonds
dt <- data.table::setDT(dt)
target <- 'y'
data_types <- get_feature_type(dt[, setdiff(names(dt), target), with=FALSE], threshold = 50)
namespaces <- list(n = list(varName = data_types$num_vars, keepSpace=FALSE),
                   c = list(varName = data_types$fact_vars, keepSpace=FALSE))
dt$y <- with(dt, ifelse(y < 5.71, 1, -1))
dt2vw(dt, 'diamonds.vw', namespaces, target=target, weight=NULL)

allLines <- readLines("diamonds.vw")
N <- length(allLines)
writeLines(allLines[1:10000], "X_train.vw")
writeLines(allLines[10001:N], "X_valid.vw")
write.table(tail(dt$y,43940), file='valid_labels.txt',
            row.names = FALSE, col.names = FALSE, quote = FALSE)

training_data <- "X_train.vw"
validation_data <- "X_valid.vw"
validation_labels <- "valid_labels.txt"
out_probs <- "out.txt"
model <- "mdl.vw"

The code below will represent:
Training using single hidden layer feedforward neural network with 10 hidden neurons
vw -d X_train.vw --loss_function logistic -f mdl.vw --nn 10 -b 25 --passes 1 -c
Testing
vw -t -i mdl.vw -p out.txt -d X_valid.vw --link=logistic
Which will only compute predictions for ~43800 samples instead of 43940.
Also, some predictions are corrupted.
They have form: "\n0.9364170.015465\n"
Instead of form: "\n0.936417\n0.015465\n"
But even resolving this corruption will not give needed amount of samples.

For rvw this will give us:

Using perf

auc_perf <- vw(training_data = training_data, validation_data = validation_data, out_probs = "out.txt",
   validation_labels = validation_labels,
   use_perf=TRUE,
   extra = "--nn 10")

Throws warning message "Predicted values file is longer." generated py perf program.

Before PR using pROC

auc_proc <- vw(training_data = training_data, validation_data = validation_data, out_probs = "out.txt",
               validation_labels = validation_labels,
               use_perf=FALSE,
               extra = "--nn 10")

Throws an error: "The length of the probabilities and labels is different"

if (!identical(probs_len, labels_len)) {
    stop('The length of the probabilities and labels is different')
}

After PR using pROC

Probs and labels vectors will be forced to a numeric type.
This will introduce NA in some places of a vector.
Then the longest vector will be trimmed to the size of the shorter one.
Then rows with NA will be omitted in both vectors.
Resulting vectors will have equal length and no NA and will be used to compute ROC curve

auc_proc <- vw(training_data = training_data, validation_data = validation_data, out_probs = "out.txt",
               validation_labels = validation_labels,
               use_perf=FALSE,
               extra = "--nn 10")

Throws warning message "The length of the probabilities and labels is different"

@eddelbuettel
Copy link
Collaborator

Ok, it is a little hard to see what code you'd change in the example.

Also, when I currently run Rscript vw_example.R it does not end in an error...

@ivan-pavlov
Copy link
Author

Without PR I get this error message:

 Error in roc_auc(out_probs, validation_labels, plot_roc, cmd) : 
  The length of the probabilities and labels is different 
5.
stop("The length of the probabilities and labels is different") 
4.
roc_auc(out_probs, validation_labels, plot_roc, cmd) 
3.
vw(training_data = training_data, validation_data = validation_data, 
    validation_labels = validation_labels, model = model, loss = "logistic", 
    b = 25, learning_rate = g[["eta"]], passes = 2, l1 = g[["l1"]], 
    l2 = g[["l2"]], early_terminate = 2, extra = g[["extra"]],  ... 
2.
FUN(X[[i]], ...) 
1.
lapply(1:nrow(grid), function(i) {
    g <- grid[i, ]
    auc <- vw(training_data = training_data, validation_data = validation_data, 
        validation_labels = validation_labels, model = model,  ... 

When executiong grid search loop:

aucs <- lapply(1:nrow(grid), function(i) {
    g <- grid[i, ]
    auc <- vw(training_data=training_data, # files relative paths
              validation_data=validation_data,
              validation_labels=validation_labels, model=model,
              ## grid options
              loss='logistic', b=25, learning_rate=g[['eta']],
              passes=2, l1=g[['l1']], l2=g[['l2']],
              early_terminate=2, extra=g[['extra']],
              ## ROC-AUC related options
              use_perf=FALSE, plot_roc=TRUE,
              do_evaluation = TRUE # If false doesn't compute AUC, use only for prediction
              )
    auc$auc
})

To bypass it I changed this code in vw.R in roc_auc function:

Probs and labels vectors will be forced to a numeric type.
This will introduce NA in some places of a vector because of corruption in predictions file.

probs <- as.numeric(fread(out_probs)[['V1']])
labels <- as.numeric(fread(validation_labels)[['V1']])

Before was:

probs <- fread(out_probs)[['V1']]
labels <- fread(validation_labels)[['V1']]

Then the longest vector will be trimmed to the size of the shorter one.

## Bypass shorter predictions by vw
    probs_len <- length(probs)
    labels_len <- length(labels)
    if (!identical(probs_len, labels_len)) {
        # stop('The length of the probabilities and labels is different')
        warning('The length of the probabilities and labels is different')
            if (probs_len < labels_len) {
                labels <- labels[1:probs_len]
            } else {
                probs <- probs[1:labels_len]
            }
    }

Before was:

if (!identical(length(probs), length(labels)))
        stop('The length of the probabilities and labels is different')

Resulting vectors will have equal length and no NA and will be used to compute ROC curve
roc <- roc(labels, probs, auc=TRUE, print.auc=TRUE, print.thres=TRUE, na.rm = TRUE)
Before was:
roc <- roc(labels, probs, auc=TRUE, print.auc=TRUE, print.thres=TRUE)

@ivan-pavlov
Copy link
Author

I would add the code from my first comment to show how behavior changes with or without perf when --nn N is used

@eddelbuettel
Copy link
Collaborator

  1. It is really hard to see what you are trying to say as we are repeating so much identical code. For this, PRs and diffs are really much better.

  2. I just re-ran demo/vw_example.R step-by-step. No error. Any idea? It includes the nn option as I get get this:

R> results <- cbind(iter=1:nrow(grid), grid, auc=do.call(rbind, aucs))
R> print(results)
   iter    l1    l2  eta   extra      auc
1     1 1e-07 1e-07 0.10 --nn 10 0.996496
2     2 1e-08 1e-07 0.10 --nn 10 0.996496
3     3 1e-07 1e-08 0.10 --nn 10 0.996496
4     4 1e-08 1e-08 0.10 --nn 10 0.996496
5     5 1e-07 1e-07 0.05 --nn 10 0.995664
6     6 1e-08 1e-07 0.05 --nn 10 0.995664
7     7 1e-07 1e-08 0.05 --nn 10 0.995664
8     8 1e-08 1e-08 0.05 --nn 10 0.995664
9     9 1e-07 1e-07 0.10         0.987865
10   10 1e-08 1e-07 0.10         0.991949
11   11 1e-07 1e-08 0.10         0.987865
12   12 1e-08 1e-08 0.10         0.991949
13   13 1e-07 1e-07 0.05         0.988334
14   14 1e-08 1e-07 0.05         0.991517
15   15 1e-07 1e-08 0.05         0.988334
16   16 1e-08 1e-08 0.05         0.991517
R> 

@eddelbuettel
Copy link
Collaborator

Also:

edd@rob:/tmp/RtmpWzLCcH$ wc valid_labels.txt out.txt                                                     
 43940  43940 106569 valid_labels.txt               
 43940  43940 395460 out.txt                        
 87880  87880 502029 total                          
edd@rob:/tmp/RtmpWzLCcH$  

So I am at a loss. You you please prepare a self-contained script exhibit a bug?

@ivan-pavlov
Copy link
Author

If I understand correctly, I get this error because of vw program itself.
I get shorter prediction file preds.vw with some corrupted rows.
I've tried the latest homebrew version of Vowpal Wabbit and also the one I built from the latest source files. Both of them express such behavior, but you don't have errors, so I suppose it is a platform dependent problem.

rvw with perf program works with predictions and response of different sizes without errors but with roc function from pROC it fails.
I've tried to find a workaround so that it will work with pROC.

I will try to see if I get the same bug on other platforms.

This R script fails for me right now:

library(ggplot2) 
library(rvw)
library(data.table)

get_feature_type <- function(X, threshold = 50, verbose = FALSE) {
  q_levels <- function (x) {
    if (data.table::is.data.table(x)) {
      unlist(x[, lapply(.SD, function(x) length(unique(x)))])
    } else {
      apply(x, 2, function(x) length(unique(x)))
    }
  }
  
  lvs = q_levels(X)
  fact_vars = names(lvs[lvs < threshold])
  num_vars = names(lvs[lvs >= threshold])
  if (verbose) {
    print(data.frame(lvs))
  }
  list(fact_vars = fact_vars, num_vars = num_vars)
}

dt <- diamonds
dt <- data.table::setDT(dt)
target <- 'y'
data_types <- get_feature_type(dt[, setdiff(names(dt), target), with=FALSE], threshold = 50)
namespaces <- list(n = list(varName = data_types$num_vars, keepSpace=FALSE),
                   c = list(varName = data_types$fact_vars, keepSpace=FALSE))
dt$y <- with(dt, ifelse(y < 5.71, 1, -1))
dt2vw(dt, 'diamonds.vw', namespaces, target=target, weight=NULL)


allLines <- readLines("diamonds.vw")
N <- length(allLines)
writeLines(allLines[1:10000], "X_train.vw")
writeLines(allLines[10001:N], "X_valid.vw")
write.table(tail(dt$y,43940), file='valid_labels.txt',
            row.names = FALSE, col.names = FALSE, quote = FALSE)

training_data <- "X_train.vw"
validation_data <- "X_valid.vw"
validation_labels <- "valid_labels.txt"
out_probs <- "out.txt"
model <- "mdl.vw"

auc_perf <- vw(training_data = training_data, validation_data = validation_data, out_probs = "out.txt",
               validation_labels = validation_labels,
               use_perf=FALSE,
               extra = "--nn 10")

with this error message:

...
finished run
number of examples per pass = 43940
passes used = 1
weighted example sum = 43940.000000
weighted label sum = 6562.000000
average loss = 29.772478
best constant = 0.149340
best constant's loss = 0.977698
total feature number = 439376
Error in roc_auc(out_probs, validation_labels, plot_roc, cmd) : 
  The length of the probabilities and labels is different
In addition: Warning message:
In fread(out_probs) :
  Bumped column 1 to type character on data row 135, field contains '0.0008510.000842'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses a sample of 1,000 rows (100 rows at 10 points) so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

My system:

platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          4.3                         
year           2017                        
month          11                          
day            30                          
svn rev        73796                       
language       R                           
version.string R version 3.4.3 (2017-11-30)
nickname       Kite-Eating Tree   
‘data.table’ version 1.10.4-3
‘ggplot2’ version 2.2.1
‘pROC’ version 1.10.0
‘rvw’ version 0.1.1.1
vw version 8.5.0
System Version:	macOS 10.13.1 (17B1003)

@eddelbuettel
Copy link
Collaborator

I am at vw version 8.1.1 on Ubuntu 17.10. This may be something that changed in vw.

Let me see if I can quickly build 8.5.0.

@eddelbuettel
Copy link
Collaborator

Unchanged with vw 8.5.0:

[...]
finished run
number of examples per pass = 43940
passes used = 1
weighted example sum = 43940.000000
weighted label sum = 6562.000000
average loss = 29.772555
best constant = 0.149340
best constant's loss = 0.977698
total feature number = 439376

Call:
roc.default(response = labels, predictor = probs, auc = TRUE,     print.auc = TRUE, print.thres = TRUE)

Data: probs in 18689 controls (labels -1) < 25251 cases (labels 1).
Area under the curve: 0.996

Model Parameters
 /usr/bin/vw -d X_train.vw --loss_function logistic -f mdl.vw --learning_rate=0.5 --passes 1 -c -b 25 --nn 10 

AUC:  0.996387 
edd@rob:~/git/rvw/demo(master)$ 

ie I just don't see the error you are seeing.

@ivan-pavlov
Copy link
Author

Sorry for the long wait I've been testing different versions of Vowpal Wabbit with --nn N option.

I used this arguments for training and predicting:

./vw -d X_train.vw --loss_function logistic -f mdl.vw --learning_rate=0.10 --passes 2 -c --l1 1e-07 --l2 1e-07 -b 25 --early_terminate 2 --nn 10 
./vw -t -i mdl.vw -p preds.vw --link=logistic -d X_valid.vw

Similar arguments are used in vw_example.R

For versions: 7.10, 8.0, 8.1.1, 8.2.1, 8.3.1
Everything works correctly

wc valid_labels.txt preds.vw 
   43940   43940  106569 valid_labels.txt
   43756   43756  394708 preds.vw
   87696   87696  501277 total

Unfortunately I didn't manage to build 8.4.0 version.

The latest version 8.5.0 gives me following incorrect result:

wc valid_labels.txt preds.vw 
   43940   43940  106569 valid_labels.txt
   43907   43907  395347 preds.vw
   87847   87847  501916 total

I suppose this only happens on MacOS systems, but I will test such behavior on other OS.

@eddelbuettel
Copy link
Collaborator

Thanks for your patience with that; rebuilding those versions is work! I had just jumped to the (Debian package sources) of 8.5.0 from here and I then built a local package. It looks like that was a full 8.5.0 release so the difference to yours may indeed be macOS vs Linux. Strange.

Did you peek into the vw mailing list? And/or would you have a chance to test on another OS?

@ivan-pavlov
Copy link
Author

I have not yet contacted people from vw mailing list, but I am planning to do it today.
I have Win10 and Gentoo at my disposal and will try on them.

@eddelbuettel
Copy link
Collaborator

I was even thinking just about lurking on the list / looking where the issue had come up.

A cross-check on Windows or Linux should be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants