Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to deterministic 64 bit type_ids for cross-binary serialisation #2949

Closed
wants to merge 2 commits into from

Conversation

dipinhora
Copy link
Contributor

The goal of this PR/change is to make cross binary serialisation between different programs built using the same Pony compiler version possible via deterministic 64 bit type_ids.

This PR has the following open questions:

  • Performance impact of runtime type information lookup that used to be based on type_id numbering and bitwise operations on type_ids but is now based on a field in the type descriptor
  • Open questions about an edge case in pony_serialise_offset (see TODO added into that function)
  • type_ids are currently based on a hash of the reach_type_t->name but should ideally be based on the AST of the type instead
  • What to do about/how to handle 32 bit platforms
  • possibly other items I haven't considered

I've marked this as DO NOT MERGE until the open questions have been addressed.


This commit changes to using 64 deterministic type_ids for 64 bit
platforms.

NOTE: 32 bit platforms inherit most of the changes except for
the deterministic 64 bit type_ids (mainly due to serialisation /
deserialisation issues due to pointers on 32 bit platforms being
only 32 bits). Fixing this requires changing the serialisation
format for 32 bit platforms to not match the in-memory format
for types. This change is left as future work due to time
constraints. Hashing the AST tree for a type to generate its
type_id has also been left as future work due to time
constraints.

Changes include:
* changing to 64 bit type_ids
* changing all compile time and run time tests based on type_id
  numbering to now use `descriptor` based checks instead (the
  `descriptor` now holds additional information accoridngly).
  This change will likely had a performance impact which I have
  not quantified.
* remove numeric size table (the info is now part of the type
  descriptor)
* hashes the reach_type_t->name to generate the 64 bit type_id for
  64 bit platforms; 32 bit platforms get sequential numbers for
  type_id's
* changes the `key` for siphash to be based on the `md5` of the
  pony version

This commit changes to using 64 deterministic type_ids for 64 bit
platforms.

NOTE: 32 bit platforms inherit most of the changes except for
the deterministic 64 bit type_ids (mainly due to serialisation /
deserialisation issues due to pointers on 32 bit platforms being
only 32 bits). Fixing this requires changing the serialisation
format for 32 bit platforms to not match the in-memory format
for types. This change is left as future work due to time
constraints. Hashing the AST tree for a type to generate its
type_id has also been left as future work due to time
constraints.

Changes include:
* changing to 64 bit type_ids
* changing all compile time and run time tests based on type_id
  numbering to now use `descriptor` based checks instead (the
  `descriptor` now holds additional information accoridngly).
  This change will likely had a performance impact which I have
  not quantified.
* remove numeric size table (the info is now part of the type
  descriptor)
* hashes the reach_type_t->name to generate the 64 bit type_id for
  64 bit platforms; 32 bit platforms get sequential numbers for
  type_id's
* changes the `key` for siphash to be based on the `md5` of the
  pony version
@dipinhora dipinhora added the do not merge This PR should not be merged at this time label Nov 28, 2018
@jemc
Copy link
Member

jemc commented Nov 28, 2018

Performance impact of runtime type information lookup that used to be based on type_id numbering and bitwise operations on type_ids but is now based on a field in the type descriptor

Instead of changing the meaning of the type_id field, have you considered adding a new field to act as the deterministic id for serialisation purposes, and leaving the current type_id logic alone?

@dipinhora
Copy link
Contributor Author

@jemc I have not. That's a good idea. I'll think about it and see how that might work.

@dipinhora dipinhora force-pushed the deterministic_typeids branch from 7f2981c to 1aa0ea3 Compare November 30, 2018 05:11
@dipinhora dipinhora force-pushed the deterministic_typeids branch from 1aa0ea3 to 3b9c8bb Compare November 30, 2018 05:15
// twiddles.
// See TODO in pony_deserialise_offset for related issue

// If we are not in the map, we are an untraced primitive. Return the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sylvanc would you be able to provide input on when this code path might be taken? (it would also be great if you were able to review this PR overall)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high bit is set when serialising a primitive. In pony_deserialise_offset we use that to know that we can return the constant instance for the primitive type. So this can definitely happen.

@SeanTAllen
Copy link
Member

The debug builds on Windows appear to be failing.

@chalcolith
Copy link
Member

There's an LLVM type assertion that's thrown at gendesc.c:114:

Assertion failed: CastInst::castIsValid(Instruction::BitCast, C, DstTy) && "Invalid constantexpr bitcast!", file C:\Users\Gordon\Dev\Pony\ponyc-windows-libs\build\src\llvm-3.9.1.src\lib\IR\Constants.cpp, line 1703

I'm investigating this...

Also, the changes to wscript need Python 3 support:

diff --git a/wscript b/wscript
index 373c8c5a9..95a17dbe1 100644
--- a/wscript
+++ b/wscript
@@ -37,7 +37,10 @@ with open('VERSION') as v:

# build ponyc version md5
import hashlib
-temp_md5 = hashlib.md5(VERSION).hexdigest()
+if (sys.version_info > (3,0)):
+    temp_md5 = hashlib.md5(VERSION.encode('utf-8')).hexdigest()
+else:
+    temp_md5 = hashlib.md5(VERSION).hexdigest()
VERSION_FORMATTED_MD5 = "0x" + ",0x".join([temp_md5[i:i+2] for i in range(0, len(temp_md5), 2)])

# source and build directories

@chalcolith
Copy link
Member

The reason the CI fails is that Windows usually pops up a dialog for assertions in debug mode. This is actually fixed in master, but it's fortuitous in this case :-)

@chalcolith
Copy link
Member

@dipinhora the following patch fixes Windows builds for me:

diff --git a/src/libponyc/codegen/gendesc.c b/src/libponyc/codegen/gendesc.c
index 54a06cf2d..c0393094f 100644
--- a/src/libponyc/codegen/gendesc.c
+++ b/src/libponyc/codegen/gendesc.c
@@ -539,7 +539,7 @@ void gendesc_table_lookup(compile_t* c)
  codegen_finishfun(c);

  c->desc_table_offset_lookup_fn = make_desc_ptr(desc_lkp_fn,
-    c->descriptor_offset_lookup_type);
+    c->descriptor_offset_lookup_fn);
}

static LLVMValueRef desc_field(compile_t* c, LLVMValueRef desc, int index)
diff --git a/wscript b/wscript
index 373c8c5a9..95a17dbe1 100644
--- a/wscript
+++ b/wscript
@@ -37,7 +37,10 @@ with open('VERSION') as v:

# build ponyc version md5
import hashlib
-temp_md5 = hashlib.md5(VERSION).hexdigest()
+if (sys.version_info > (3,0)):
+    temp_md5 = hashlib.md5(VERSION.encode('utf-8')).hexdigest()
+else:
+    temp_md5 = hashlib.md5(VERSION).hexdigest()
VERSION_FORMATTED_MD5 = "0x" + ",0x".join([temp_md5[i:i+2] for i in range(0, len(temp_md5), 2)])

# source and build directories

@SeanTAllen
Copy link
Member

SeanTAllen commented May 19, 2019 via email

@dipinhora
Copy link
Contributor Author

@SeanTAllen yes. Might get reopened or a new one created at some point in the future if I get back to this again.

@SeanTAllen
Copy link
Member

SeanTAllen commented May 20, 2019 via email

@dipinhora
Copy link
Contributor Author

There's the windows thing. And there are the open questions I listed in the original comment when I opened this PR.

Aside from the progress Gordon made regarding Windows, there hasn't been much progress or clarification around the open questions. Given that I am not likely to look at this in the near future, it seemed best to close the PR rather than leave it sitting stale but open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge This PR should not be merged at this time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants