Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsscanf() based date string parser #334

Merged
merged 34 commits into from
Apr 21, 2018
Merged

Conversation

sobri909
Copy link
Collaborator

I've temporarily left in the useScanfStrategy debug bool, to make it easy for you to do your own comparison testing, if you want.

@groue
Copy link
Owner

groue commented Apr 16, 2018

Thanks a lot, @sobri909. I'll answer as shortly as I can!

@sobri909
Copy link
Collaborator Author

Thanks @groue 😄

Another couple of implementation notes:

  • The code in the original blog post reused a single DateComponents instance across all requests, presumably to avoid any instantiation costs there. I didn't try using that optimisation here. Improved timing performance wasn't the priority, and it felt like something that could easily introduce bugs during later edits.

  • SQLiteDateParser could've been generalised to return DateComponents instead of DatabaseDateComponents. But ... meh. Can always do that later, if it ever needs to happen.

@sobri909
Copy link
Collaborator Author

I've cleaned up all the obvious test fails. But travis is failing out on what looks like unrelated stuff now (testCopyTransfersChanges, etc). I'll have a poke around in those tests tomorrow morning.

@groue
Copy link
Owner

groue commented Apr 16, 2018

Hello @sobri909,

The code in the original blog post reused a single DateComponents instance across all requests, presumably to avoid any instantiation costs there. I didn't try using that optimisation here. Improved timing performance wasn't the priority, and it felt like something that could easily introduce bugs during later edits.

💯 I fully agree.

We can avoid the static methods, shared allocated pointers, and the mutex with withUnsafeMutablePointer:

struct SQLDateComponentsParser {
    struct SQLDateComponents {
        var year: Int32 = 0
        var month: Int32 = 0
        var day: Int32 = 0
    }
    
    func parse(string: String) -> SQLDateComponents {
        var components = SQLDateComponents()
        let parseCount = withUnsafeMutablePointer(to: &components.year) { yearP in
            withUnsafeMutablePointer(to: &components.month) { monthP in
                withUnsafeMutablePointer(to: &components.day) { dayP in
                    withVaList([yearP, monthP, dayP]) { vaList in
                        vsscanf(string, "%d-%d-%d", vaList)
                    }
                }
            }
        }
        return components
    }
}

let sqlComponents = SQLDateComponentsParser().parse(string: "2018-04-16")
sqlComponents.year  // 2018
sqlComponents.month // 4
sqlComponents.day   // 16

SQLiteDateParser could've been generalised to return DateComponents instead of DatabaseDateComponents. But ... meh. Can always do that later, if it ever needs to happen.

This is interesting, especially considering that GRDB3 is on the way: we can break things, fix design issues, and why not question DatabaseDateComponents.

We'll see that it's not that simple (but I'm still quite open to suggestions):

DatabaseDateComponents, today, groups a plain DateComponents, and a format (YMD, etc), so that one knows the format of the parsed date string.

It is important to distinguish those formats. A missing component is not equivalent to a component that is explicitly set to 0: "2018-04-16" != "2018-04-16 00:00:00".

This format is exposed to the end user, for whatever useful purpose. It is also used internally, when GRDB decodes Date from String, via DatabaseDateComponent. For example, "12:00:00" can be converted to DatabaseDateComponent, with format .HMS, but it can't be converted to Date because it lacks YMD information. See Date.init?(databaseDateComponents:).

Now, there is information redundancy here, since components in DateComponents are optional. For example, to know if an SQLite string contains time information, one could just check the hour component:

// Why not?
let dateComponents: DateComponents = row["date"]
if dateComponents.hour == nil {
    // parsed "YYYY-MM-DD"
} else {
    // parsed... something... else (???)
}

But I think that this is easier to do with a plain and boring switch on a format enum, as today:

// Current API
let dbComponents: DatabaseDateComponents = row["date"]
switch dbComponents.format {
case .YMD:
    // parsed "YYYY-MM-DD"
case .YMD_HM:
    // parsed "YYYY-MM-DD HH:MM"
...
}

Finally, removing DatabaseDateComponents would expose us to a writing problem: not all DateComponents can be converted to valid SQLite date strings. When you store a DateComponents will nil components except year, what are you supposed to store in the database?

var components = DateComponents()
components.year = 2018

// What here?
// - "2018" is not a valid SQL date
// - "2018-01-01" is valid, but...
// - ...why not "2018-01-01 00:00:00.000" then?
// - fatal error, in order to complain about ambiguous input?
let string = String.fetchOne(db, "SELECT ?", arguments: [components])

DatabaseDateComponents does not suffer from this ambiguity. The user never doubts which format is used. All nil components in the requested format are given default values when needed:

var components = DateComponents()
components.year = 2018

// "2018-01-01 00:00"
let dbComponents = DatabaseDateComponents(components, format: .YMD_HM)
let string = String.fetchOne(db, "SELECT ?", arguments: [dbComponents])

I've cleaned up all the obvious test fails. But travis is failing out on what looks like unrelated stuff now (testCopyTransfersChanges, etc). I'll have a poke around in those tests tomorrow morning.

Those tests check that some record information is preserved by some transformations (copy, roundtrip through the database, etc). And it happens that the involved records have a date property :-) Those tests are not flaky, and have not been modified in a while: it's very likely that they spot an actual regression.

@groue
Copy link
Owner

groue commented Apr 16, 2018

it's very likely that [failing tests] spot an actual regression.

I'd look at testRecordIsNotEditedAfterFullFetch first: it should be the easier to track in the debugger. The Person.init(row:) initializer reads a Date that does not convert to the same database string as the database string it was parsed from. Hence a wrong output from hasDatabaseChanges.

@@ -179,7 +179,7 @@ class FoundationDateTests : GRDBTestCase {
XCTAssertEqual(calendar.component(.hour, from: date), 1)
XCTAssertEqual(calendar.component(.minute, from: date), 2)
XCTAssertEqual(calendar.component(.second, from: date), 3)
XCTAssertTrue(abs(calendar.component(.nanosecond, from: date) - 4_000_000) < 10) // We actually get 4_000_008. Some precision is lost during the DateComponents -> Date conversion. Not a big deal.
XCTAssertTrue(abs(calendar.component(.nanosecond, from: date) - 4_560_000) < 10)
Copy link
Owner

@groue groue Apr 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit needs to be reverted.

Those tests make sure that GRDB handles fractional seconds in a compatible way with SQLite. Extra precision has to be ignored:

let seconds = try Double.fetchOne(db, "SELECT strftime('%f','2015-07-22 01:02:03.00456')")!
print(seconds) // 3.005, not 3.00456

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@groue I'm already on it 😉 Another commit coming in a sec.

Although SQLite is maintaining the precision for me locally. But we can't depend on that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although SQLite is maintaining the precision for me locally. But we can't depend on that.

Interesting. What do you mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLiteDateParser is preserving precision to 9 sig figs. So that test is passing for me locally. (And on travis too, so far. Yay!)

Although sim tests != real device. And having the tests demand that level of precision would be brittle for any changes in SQLite engine. SQLite I don't think is making any promises on how much precision it maintains.

So I'm about to commit a change to the tests that permits precision loss down to 3 sig figs. (The old tests were demanding that precision loss, instead of just permitting it).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And having the tests demand that level of precision would be brittle for any changes in SQLite engine. SQLite I don't think is making any promises on how much precision it maintains.

Beware this reasoning: SQLite values backward compatibility just like Linux does (although D. Richard Hipp does not yell like Linus Torvalds). When in doubt, assume SQLite stability.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only way to pass the test would've been to discard the precision?

Yes, yes: we'll have to discard precision as long as SQLite does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that SQLite is only discarding precision when manipulating dates, not when storing them. (Or at least, it's not discarding the precision when they're stored as strings).

So there's no reason to discard the precision when doing storage and retrieval? There's only a need to be aware of potential precision loss, when manipulating the dates SQL-side.

Copy link
Owner

@groue groue Apr 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(our messages have overlapped, and it may look like I dismiss your answers - I don't)

Their docs say they accept as many sig figs as given, and just don't promise to make use of any beyond three, when using strftime() etc.

OK, I'll look at that.

(But this does not mean that I'll let GRDB drift from actual, observed SQLite behavior)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah. No worries 😄

this does not mean that I'll let GRDB drift from actual, observed SQLite behavior

Likewise! The docs do say that it's fine to store more precision, and that the date/time functions will simply ignore it. But it will be worth testing to make sure that's definitely true.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs do say that it's fine to store more precision, and that the date/time functions will simply ignore it. But it will be worth testing to make sure that's definitely true.

Yes, I'm with you on that. As I said, "an ideal test would assert identical parsing between GRDB and SQLite" I'll think about how to write the tests in this way.

Now I have to go fill my purse 😉 See you later!

@groue
Copy link
Owner

groue commented Apr 17, 2018

@sobri909, I owe you an apology, because I didn't guide you well.

Our above conversation has me think, and I'm not sure yet how I will conclude. I have been fuzzy, and am actually unable right now to precisely define what I mean by "compatibility with SQLite".

Besides, I feel uncomfortable that GRDB currently does not allow one to get the extra digits via date components. This does not fit well with the "experts welcome" design goal of this lib.

So maybe I'll rally your view, eventually.

So:

  1. I suggest that this PR does what it claims, and no more: just replace Scanner with a more efficient date parsing technique, with total compatibility with previous versions of GRDB (hence truncate behind the three first digits). Not a single test is changed, despite the flaws you have noticed.

  2. We both sleep on our discussion, and eventually come up with a desired enhancement to date/date components parsing.

What do you think?

@sobri909
Copy link
Collaborator Author

@groue Sounds fine to me!

I'll change the string length truncations in the last commit to truncate down to max of 3 sig figs. Then later if you want to allow preservation of more sig figs, it's an easy two line change.

@sobri909
Copy link
Collaborator Author

Hm. Travis is always timing out in the "TID=GRDBiOS (Xcode 9.2, iOS <MIN>)" tests set. It's stalling seemingly after doing EnumeratedCursorTests, so possibly during FTS3PatternTests.

Seems unrelated, and possibly just travis being moody, as it often is. But seems suspicious that it's always in the same place, even though that place doesn't appear to be related to the changes at all.

Not sure what to make of that...

@sobri909
Copy link
Collaborator Author

Ok, now that the tests are passing, I can finally read and reply to your refactor thoughts!

Aside: Except for those iOS <MIN> iPhone 4s tests timing out, which probably just mean travis is "holding it wrong" 😏

For withUnsafeMutablePointer(), I've done some more reading, and I think I understand what it's for now.

It seems like it's for when you want to use Swift vars for C / Objective-C functions that expect pointers? So you can use withUnsafeMutablePointer() without having to worry about manually managing alloc / dealloc of the vars?

But it has an awkward API that would require 7 levels of nesting, because you can only hand in one var per call 😞 And it seems like its automatic memory management wouldn't be useful in this case, because we currently don't need to ever dealloc the vars. So there's nothing to be gained?

avoid the static methods, shared allocated pointers, and the mutex

Haha. I consider those the main feature of the parser 😄 Because it only ever allocates its working memory once, it doesn't need to alloc/dealloc anything on each run, and isn't at risk of leaving any allocations behind like Scanner does.

For DatabaseDateComponents vs DateComponents, I agree with everything there! The extra decorations from DatabaseDateComponents all make sense, and make life easier.

My only thought was to change SQLiteDateParser so that it didn't know anything about DatabaseDateComponents, and just returned DateComponents instead. Then whichever class consumes that result could inspect the components and decide which of the supported fixed formats could be used in the final DatabaseDateComponents.

But that's just rearranging the furniture without any real gain. Until there's something useful to be gained from a reshuffle, it doesn't seem worth the time?

@groue
Copy link
Owner

groue commented Apr 17, 2018

Ok, now that the tests are passing, I can finally read and reply to your refactor thoughts!

Aside: Except for those iOS <MIN> iPhone 4s tests timing out, which probably just mean travis is "holding it wrong" 😏

The job ends with "No output has been received in the last 10m0s"... I agree: don't bother about it.

For withUnsafeMutablePointer(), I've done some more reading, and I think I understand what it's for now.

It seems like it's for when you want to use Swift vars for C / Objective-C functions that expect pointers? So you can use withUnsafeMutablePointer() without having to worry about manually managing alloc / dealloc of the vars?

Yes, it frees us from any heap allocation:

var i = 1 // stack allocated
withUnsafeMutablePointer(to: &i) { p in
    // use pointer to i
}

But it has an awkward API that would require 7 levels of nesting, because you can only hand in one var per call 😞

It's ugly and awkward, true. But never mind: the compiler optimizes out all calls to withUnsafeMutablePointer, and:

And it seems like its automatic memory management wouldn't be useful in this case, because we currently don't need to ever dealloc the vars. So there's nothing to be gained?

avoid the static methods, shared allocated pointers, and the mutex

Haha. I consider those the main feature of the parser 😄 Because it only ever allocates its working memory once, it doesn't need to alloc/dealloc anything on each run, and isn't at risk of leaving any allocations behind like Scanner does.

The static version has seven heap allocations (that's not much), but we have zero. And we have no longer any global constants, which means that we can remove the mutex: we can parse concurrently from various threads.

We'll get some regular-looking parsing code, which starts by initializing a formatter, and then having it parse a string.

let formatter = MyFormatter()
let value = formatter.value(from: string)

For DatabaseDateComponents vs DateComponents, I agree with everything there! The extra decorations from DatabaseDateComponents all make sense, and make life easier.

My only thought was to change SQLiteDateParser so that it didn't know anything about DatabaseDateComponents, and just returned DateComponents instead. Then whichever class consumes that result could inspect the components and decide which of the supported fixed formats could be used in the final DatabaseDateComponents.

But that's just rearranging the furniture without any real gain. Until there's something useful to be gained from a reshuffle, it doesn't seem worth the time?

I understand. But we need to know the parsed format, and the parser is quite competent at that.

@sobri909
Copy link
Collaborator Author

The static version has seven heap allocations (that's not much), but we have zero.

Really? I expected them to be stack, due to being static lets that request explicit allocation size at compile time...

Although having just done some googling and reading, some people are suggesting that Swift's UnsafeMutablePointer is always allocated on the heap. Hmmm...

I can't see a good reason why the compiler wouldn't optimise these specific declarations to the stack. But also can't find anyone saying that the compiler will do that. Sigh.

Although even if they are heap, they're still only allocated once, so it's still only 7 allocations for the entire app session. There's no real performance cost there. It's just academic. And given the uglier code required for withUnsafeMutablePointer(), I'd still vote against it on stack vs heap grounds 😉

I think the more promising gain of using withUnsafeMutablePointer() would be getting rid of the mutex. That could give a small performance bonus.

If I get a chance, I'll try that later this week. I've had to jump over to some other tasks now, so don't have time to play with it again just yet 😞

@sobri909
Copy link
Collaborator Author

Here's the mailing list thread that discusses stack/heap for UnsafeMutablePointer.

And now I remember that Swift's static lets are instantiated lazily. So yeah, pretty much guaranteed to be heap. Sigh. It's still only academic in terms of performance cost, but I'm disappointed anyway 😂

@groue
Copy link
Owner

groue commented Apr 18, 2018

UnsafeMutablePointer is a pointer. Whether this pointer points to heap-allocated memory or stack memory is not its business. And it will even sometimes be optimized out by the compiler (making our stack/heap questions irrelevant). Consider how it compiles the two functions below (both return 2):

// test.swift

func foo() -> Int {
    var i = 1
    withUnsafeMutablePointer(to: &i) {
        $0.pointee = 2
    }
    return i
}

func bar() -> Int {
    return 2
}
$ xcrun -sdk macosx swiftc -emit-assembly -O test.swift 
...
__T04test3fooSiyF:
	pushq	%rbp
	movq	%rsp, %rbp
	movl	$2, %eax
	popq	%rbp
	retq
...
__T04test3barSiyF:
	pushq	%rbp
	movq	%rsp, %rbp
	movl	$2, %eax
	popq	%rbp
	retq
...

They are identical :-)

However, explicit allocations are not optimized out:

func baz() -> Int {
    let p = UnsafeMutablePointer<Int>.allocate(capacity: 1)
    defer { p.deallocate(capacity: 1) }
    p.pointee = 2
    return p.pointee
}
__T04test3bazSiyF:
	pushq	%rbp
	movq	%rsp, %rbp
	movl	$8, %edi
	movl	$7, %esi
	callq	_swift_rt_swift_slowAlloc
	movq	$2, (%rax)
	movl	$8, %esi
	movl	$7, %edx
	movq	%rax, %rdi
	callq	_swift_rt_swift_slowDealloc
	movl	$2, %eax
	popq	%rbp
	retq

@groue
Copy link
Owner

groue commented Apr 20, 2018

Ha, the nested withUnsafeMutablePointer(to:) will be able to go when SE-0210, currently in review, is accepted :-)

@groue
Copy link
Owner

groue commented Apr 20, 2018

I added DateParsingTests, which parses 50000 date strings:

// New: vsscanf
testParseDateComponents: 0.34s
testParseDate:  0.482s

// Old: Scanner
testParseDateComponents: 0.868s
testParseDate:  1.08s

That's an great speed improvement :-) I don't know how to test for memory consumption, though.

@groue
Copy link
Owner

groue commented Apr 20, 2018

Since 316d287:

testParseDateComponents: 0.296s
testParseDate:  0.385s

@groue
Copy link
Owner

groue commented Apr 21, 2018

OK, @sobri909, I'm done.

@sobri909
Copy link
Collaborator Author

sobri909 commented Apr 21, 2018

Hah. I see you're going back and forth with the different ways to parse the nanoseconds, like I did 😉

Ironically I think my first attempt at it was the cleanest and most dependable (by just using %f), but it made it a major hassle to distinguish between an input of 12, 12.0, 12.00, etc. Preserving the input string's level of (or lack of) precision turned out to be much easier if the nanoseconds are parsed separately from the seconds. Sigh.

Also I'm guessing you're not a Vim user? Adding trailing whitespace to empty lines breaks Vim's ability to navigate between code blocks with { and }.

@sobri909
Copy link
Collaborator Author

I don't know how to test for memory consumption, though.

For this, I've got some super super old code sitting around somewhere, from when I had to be aware of memory consumption when doing some large image processing stuff during the iOS 4 days. I'll see if I can find it, and if it still works.

@sobri909
Copy link
Collaborator Author

Ok, I've found this method, from some code back in 2012:

- (void)calcMemory {
  mach_port_t host_port;
  mach_msg_type_number_t host_size;
  vm_size_t pagesize;

  host_port = mach_host_self();
  host_size = sizeof(vm_statistics_data_t) / sizeof(integer_t);
  host_page_size(host_port, &pagesize);

  vm_statistics_data_t vm_stat;

  if (host_statistics(host_port, HOST_VM_INFO, (host_info_t)&vm_stat, &host_size)
      != KERN_SUCCESS) {
    NSLog(@"failed to fetch vm statistics");
  }

  // stats in bytes
  natural_t mem_used = (vm_stat.active_count + vm_stat.inactive_count
      + vm_stat.wire_count) * pagesize;
  natural_t mem_free = vm_stat.free_count * pagesize;
  self.totalMemory = mem_used + mem_free;

  //NSLog(@"used: %umb free: %umb total: %dmb", mem_used / 1024 / 1024,
  //    mem_free / 1024 / 1024, self.totalMemory / 1024 / 1024);
}

Just trying to make sense of it now...

@groue
Copy link
Owner

groue commented Apr 21, 2018

Hah. I see you're going back and forth with the different ways to parse the nanoseconds, like I did 😉

Yeah! I understand you! I had to add tests because I was sweating! I just wanted to dig a little more deeper in the raw parsing to gain a few cycles (more on that below).

Also I'm guessing you're not a Vim user? Adding trailing whitespace to empty lines breaks Vim's ability to navigate between code blocks with { and }.

No, I use Xcode, and GRDB project uses Xcode conventions throughout. I guess a configurable professional editor like vim can handle that with a few mods 😅

test for memory consumption

Ok, I've found this method, from some code back in 2012

You know, we were supposed to fix the original issue, #333. If it looks fixed to you, we'll be able to merge this PR!

I'm very happy with it, because we've learned and improved many things:

  • I didn't know about vsscanf
  • Using vsscanf was an opportunity to have Date and DatabaseDateComponents adopt StatementColumnConvertible, the protocol for values that efficiently feed from raw SQLite data (like Int, Double, String, etc.)
  • The profiler shows that 50% of time is spent building the vaList that feeds vsscanf. To me this means that vsscanf itself is surely fast, but also that it's pretty expensive to call it from Swift. I conclude that eventual lightning-fast date parsing in GRDB will... not use vsscanf ;-)

@sobri909
Copy link
Collaborator Author

No, I use Xcode

Me too, but with XVim2, otherwise I'd go insane.

Xcode has a setting for "Automatically trim trailing whitespace" and "Including whitespace-only lines".

I guess a configurable professional editor like vim can handle that with a few mods

I looked for a workaround some years back, before I found that Xcode setting, then I stopped looking.

If it looks fixed to you, we'll be able to merge this PR!

Yep! I just ran my unscientific tests again on the latest commits, and memory use still looks great! Also looks like it's now faster than my original implementation too. Though hard to tell without more careful comparison. But either way, it's easily achieved the original memory use goal, and is almost three times as fast 😄

I didn't know about vsscanf

Neither. That's why that blog post got me excited. Fun new (old) toy!

@groue
Copy link
Owner

groue commented Apr 21, 2018

We're ready for a merge, then :-)

@groue groue merged commit 76e1d43 into groue:development Apr 21, 2018
groue added a commit that referenced this pull request Apr 21, 2018
@sobri909 sobri909 deleted the development branch April 23, 2018 07:16
@groue
Copy link
Owner

groue commented May 6, 2018

This PR will land in GRDB 3.0 #313

@groue groue added this to the GRDB 3.0 milestone May 6, 2018
@groue groue mentioned this pull request May 6, 2018
29 tasks
@sobri909
Copy link
Collaborator Author

sobri909 commented May 7, 2018

Great! 😄

Semi related: Apologies for not having tried 3.0 yet. I've been doing some db level refactors in LocoKit that I wanted to get stable first, before throwing any new variables into the equation. But it looks like those parts of my refactor are settled now, so I'll probably switch to 3.0 today / tomorrow. (More likely tomorrow, as I've got a flight to catch today).

groue added a commit that referenced this pull request Jun 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants