Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast fields to numeric or other types in a fixed type data loader file using pgloader #245

Closed
vijmen opened this issue Jun 9, 2015 · 8 comments

Comments

@vijmen
Copy link

vijmen commented Jun 9, 2015

This is not an issue but a question. Not sure whether this is the right place, so, please let me know if I should post it elsewhere.

Is it possible to cast fields to numeric or other types in a fixed type data loader file using pgloader?

I have a working .load file for pgloader which works if all the data types are text. However, I haven't been able to use pgloader to cast the field values to any other data type. Is it possible to do this either using a .load file or the pgloader command line? Working script with values removed below:

LOAD FIXED
FROM
http://testfile (on S3)
WITH ENCODING UTF8
(
        f1 from 0 for 4,
        f2 from 4 for 18
)
INTO
        postgresql://<user>:<pwd>@<db-details>?<table>
        (
                f1,
                f2
        )
WITH TRUNCATE
AFTER LOAD DO
$$
        SELECT * from post_processor();
$$
;
@dimitri
Copy link
Owner

dimitri commented Jun 9, 2015

It is possible, either by sending directly text representation of numeric types, or by using transformation functions on the source data when necessary. Please consider posting error messages of what you get when you try so that I can better help you here.

@vijmen
Copy link
Author

vijmen commented Jun 10, 2015

Thanks for your response. Yes, pgloader loaded in text representations into numeric columns successfully.

Was also wondering about the best way to convert a date format like "08-04-201500:00" before loading.

@vijmen
Copy link
Author

vijmen commented Jun 10, 2015

To add to my date casting question (08-04-201500:00 to timestamptz), I tried the following.

LOAD FIXED
FROM
http://testfile (on S3)
WITH ENCODING UTF8
(
        f1 from 0 for 4,
        f2 from 4 for 18
)
INTO
        postgresql://<user>:<pwd>@<db-details>?<table>
        (
                f1,
                f2
        )
CAST dbtable.f2 to timestamptz USING date-with-no-separator
WITH TRUNCATE
AFTER LOAD DO
$$
        SELECT * from post_processor();
$$
;

This threw the error (Please ignore any line numbers as I am using a working load script:

LOAD FIXED
  ^ (Line 1, Column 0, Position 0)

Could not parse subexpression ";" when parsing

 Expression PGLOADER.PARSER::COMMANDS
    Subexpression (+ PGLOADER.PARSER::COMMAND)
    Subexpression PGLOADER.PARSER::COMMAND
    Subexpression (AND
                   (OR PGLOADER.PARSER::LOAD-ARCHIVE
                       PGLOADER.PARSER::LOAD-CSV-FILE
                       PGLOADER.PARSER::LOAD-FIXED-COLS-FILE
                       PGLOADER.PARSER::LOAD-COPY-FILE
                       PGLOADER.PARSER::LOAD-DBF-FILE
                       PGLOADER.PARSER::LOAD-IXF-FILE
                       PGLOADER.PARSER::LOAD-MYSQL-DATABASE
                       PGLOADER.PARSER::LOAD-MSSQL-DATABASE
                       PGLOADER.PARSER::LOAD-SQLITE-DATABASE)
                   PGLOADER.PARSER::END-OF-COMMAND)
    Subexpression PGLOADER.PARSER::END-OF-COMMAND
    Subexpression (AND PGLOADER.PARSER::IGNORE-WHITESPACE #\;
                       PGLOADER.PARSER::IGNORE-WHITESPACE)
    Subexpression ";"

; 
; compilation unit aborted
;   caught 1 fatal ERROR condition

How would I pass in the required date format?

@dimitri
Copy link
Owner

dimitri commented Jun 10, 2015

See https://github.com/dimitri/pgloader/blob/master/test/fixed.load for an example of that. Basically there's no CAST support when loading from a file, because there's no type metadata in the source.

LOAD FIXED FROM http://testfile (on S3) WITH ENCODING UTF8
(
        f1 from 0 for 4,
        f2 from 4 for 18
)
INTO postgresql://<user>:<pwd>@<db-details>?<table>
(
       f1,
       f2 timestamptz using (date-with-no-separator f2)
)
WITH TRUNCATE
AFTER LOAD DO
$$
        SELECT * from post_processor();
$$
;

@vijmen
Copy link
Author

vijmen commented Jun 11, 2015

Hi Dimitri

Thanks for responding again.

I tried using the following format:

f2 using (date-with-no-separator f2 (:year 6 10 :month 3 5 :day 0 2 :hour 10 12 :minute 13 15 :seconds 13 15))

Input date: 08-04-201500:00

Errors below (Again, please ignore the line numbers):

; in: LAMBDA (PGLOADER.SOURCES::ROW)
;     (PGLOADER.TRANSFORMS::DATE-WITH-NO-SEPARATOR PGLOADER.TRANSFORMS::START_DATE
;                                                  (PGLOADER.TRANSFORMS::|:YEAR| 6
;                                                   10
;                                                   PGLOADER.TRANSFORMS::|:MONTH| 3
;                                                   5 PGLOADER.TRANSFORMS::|:DAY| 0
;                                                   2 PGLOADER.TRANSFORMS::|:HOUR|
;                                                   10 12 ...))
; --> BLOCK COND IF COND IF COND IF COND THE PROGN DESTRUCTURING-BIND LET LET* 
; --> LET* IF OR LET STRING= 
; ==>
;   (SB-KERNEL:STRING=* SB-C::STRING1 SB-C::STRING2 SB-C::START1 SB-C::END1
;                       SB-C::START2 SB-C::END2)
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a (OR (VECTOR CHARACTER) (VECTOR NIL) BASE-STRING
;                               SYMBOL CHARACTER), not a STRING.

; --> BLOCK COND IF COND IF COND IF COND THE PROGN DESTRUCTURING-BIND LET LET* 
; --> LET* IF OR LET IF OR LET STRING= 
; ==>
;   (SB-KERNEL:STRING=* SB-C::STRING1 SB-C::STRING2 SB-C::START1 SB-C::END1
;                       SB-C::START2 SB-C::END2)
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a (OR (VECTOR CHARACTER) (VECTOR NIL) BASE-STRING
;                               SYMBOL CHARACTER), not a STRING.

; --> BLOCK COND IF COND IF COND IF COND THE PROGN DESTRUCTURING-BIND LET LET* 
; --> LET* IF OR LET IF OR LET IF OR THE STRING= 
; ==>
;   (SB-KERNEL:STRING=* SB-C::STRING1 SB-C::STRING2 SB-C::START1 SB-C::END1
;                       SB-C::START2 SB-C::END2)
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a (OR (VECTOR CHARACTER) (VECTOR NIL) BASE-STRING
;                               SYMBOL CHARACTER), not a STRING.
; in: LAMBDA (PGLOADER.SOURCES::ROW)
;     (PGLOADER.TRANSFORMS::|:YEAR| 6 10 PGLOADER.TRANSFORMS::|:MONTH| 3 5
;      PGLOADER.TRANSFORMS::|:DAY| 0 2 PGLOADER.TRANSFORMS::|:HOUR| 10 12 ...)
; 
; caught WARNING:
;   undefined variable: PGLOADER.TRANSFORMS::|:DAY|
; 
; caught WARNING:
;   undefined variable: PGLOADER.TRANSFORMS::|:HOUR|
; 
; caught WARNING:
;   undefined variable: PGLOADER.TRANSFORMS::|:MINUTE|
; 
; caught WARNING:
;   undefined variable: PGLOADER.TRANSFORMS::|:MONTH|
; 
; caught WARNING:
;   undefined variable: PGLOADER.TRANSFORMS::|:SECONDS|
; 
; caught STYLE-WARNING:
;   undefined function: PGLOADER.TRANSFORMS::|:YEAR|
; 
; compilation unit finished
;   Undefined function:
;     PGLOADER.TRANSFORMS::|:YEAR|
;   Undefined variables:
;     PGLOADER.TRANSFORMS::|:DAY| PGLOADER.TRANSFORMS::|:HOUR| PGLOADER.TRANSFORMS::|:MINUTE| PGLOADER.TRANSFORMS::|:MONTH| PGLOADER.TRANSFORMS::|:SECONDS|
;   caught 5 WARNING conditions
;   caught 1 STYLE-WARNING condition
;   printed 3 notes
.
.
.
2015-06-11T03:51:32.461000+03:00 ERROR Could not read line 1: The variable PGLOADER.TRANSFORMS::|:MONTH| is unbound.

What would you suggest?

Thanks
VM

@dimitri
Copy link
Owner

dimitri commented Jun 11, 2015

Ok, several problems here, several of those I should fix in pgloader itself. First, you need to quote the parameter list, using e.g. (quote (:year ...)). Second, the list should be written (quote ((:year 6 10) (:month ...) ...)). Third, the date-with-no-separator checks the length of the date string and returns NULL when not 14 chars long, so it will fail to import your data.

The current best option is then to write your own date parsing facility and load it from the command line, as in the following example:

./build/bin/pgloader --load-lisp-file ~/dev/temp/245.lisp ~/dev/temp/245.load
       table name       read   imported     errors            time
-----------------  ---------  ---------  ---------  --------------
            fetch          0          0          0          0.012s
      before load          2          2          0          0.018s
-----------------  ---------  ---------  ---------  --------------
             t242          4          4          0          0.022s
-----------------  ---------  ---------  ---------  --------------
Total import time          4          4          0          0.052s
2015-06-11T12:52:14.027000+02:00 LOG Main logs in '/private/tmp/pgloader/pgloader.log'
2015-06-11T12:52:14.031000+02:00 LOG Data errors in '/private/tmp/pgloader/'
2015-06-11T12:52:14.031000+02:00 LOG Parsing commands from file #P"/Users/dim/dev/temp/245.load"

And the files I've been using, first 245.lisp which just a copy paste from the function you wanted to use, with the length check removed and the comments not updated, sadly:

(in-package #:pgloader.transforms)

(defun parse-date-format
    (date-string
     &optional (format '((:year     6 10)
                         (:month    3  5)
                         (:day      0  2)
                         (:hour    10 12)
                         (:minute  13 15)
                         (:seconds 13 15))))
  "Apply this function when input date in like '20041002152952'"
  ;; only process non-zero dates
  (declare (type (or null string) date-string))
  (cond ((null date-string)                nil)
        ((string= date-string "")          nil)
        (t
         (destructuring-bind (&key year month day hour minute seconds
                                   &allow-other-keys)
             (loop
                for (name start end) in format
                append (list name (subseq date-string start end)))
           (if (or (string= year  "0000")
                   (string= month "00")
                   (string= day   "00"))
               nil
               (format nil "~a-~a-~a ~a:~a:~a"
                       year month day hour minute seconds))))))

And the load file, 245.load:

LOAD FIXED FROM INLINE
(
        f1 from 0 for 4,
        f2 from 4 for 18
)
INTO pgsql:///pgloader?t242
(
       f1,
       f2 timestamptz using (parse-date-format f2)
)
WITH TRUNCATE
BEFORE LOAD DO
  $$ drop table if exists t242; $$,
  $$ create table t242(f1 integer, f2 timestamptz); $$;


000108-04-201500:00
000209-04-201500:00
000310-04-201500:00
000411-04-201500:00

@vijmen
Copy link
Author

vijmen commented Jun 12, 2015

Hi Dimitri
That worked perfectly.
Thanks a lot for this; you've been really helpful :)

@dimitri
Copy link
Owner

dimitri commented Jun 12, 2015

Cool, closing the issue then! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants