Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing Integer.MAX column size limit. #3743

Merged
merged 16 commits into from
Feb 17, 2017

Conversation

akashdw
Copy link
Contributor

@akashdw akashdw commented Dec 6, 2016

This PR addresses #3513.

Note: Introduced genericIndexed version 3 instead of 2 because I see an open PR(#3069) by @pjain1 to introduce version 2.

@pjain1
Copy link
Member

pjain1 commented Dec 6, 2016

I am on vacation starting tomorrow and since #3069 has merge conflicts and has not been reviewed fully so it would make sense to make this version 2 and #3069 can be version 3 as and when it gets in so that this PR is not blocked.

@leventov
Copy link
Member

leventov commented Dec 7, 2016

I suggest #3570 to be merged first, because @jon-wei already did a lot of rebases

@@ -82,6 +132,11 @@ public void write(T objectToWrite) throws IOException
valuesOut.write(bytesToWrite);

headerOut.write(Ints.toByteArray((int) valuesOut.getCount()));
headerOutLong.write(Longs.toByteArray(valuesOut.getCount()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid writing headerOutLong file in cases where its not needed? may be, we can write headerOutLong on demand at the point when fileSizeLimit is crossed (and stop writing headerOut afterwards)? in most use cases, headerOutLong would not need to be written at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that I need to build headerOutLong from headerOut values during switch, I decided to keep the same info in headerOutLong to avoid building from headerOut and delete it if not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I can totally avoid writing into headerOut after the switch. Will update the pr not to write into headerOut after switch.

{
headerOut.close();
headerOutLong.close();
valuesOut.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 3 lines are common in closeXXX() and can be moved to close()

headerOutLong.close();
valuesOut.close();

// revisit this check.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it.

}
finally {
metaOut.close();
throw new ISE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say a druid user got this error during batch/realtime ingestion , it will continue to happen on retries ... what would be the resolution?

Copy link
Contributor Author

@akashdw akashdw Jan 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen if column values are greater than 2^31 bytes and each value split has only one element. but yes for this kind of extreme cases its the design limitation and retries will continue to throw this exception.

);
}
throw new IAE("Unknown version[%s]", versionFromBuffer);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this just be read(buffer,strategy,null) with possible error msg updated in other read if necessary.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other read(with smoosher) will again read version from the buffer, in that case it will error out.

@akashdw akashdw force-pushed the large_column_support branch from 5c2b3b8 to bd90ff3 Compare January 18, 2017 02:16
@akashdw
Copy link
Contributor Author

akashdw commented Jan 18, 2017

@himanshug Updated the PR to create headerLong only after hitting the limit, also changed the version number to use 2 instead of 3.

@akashdw akashdw closed this Jan 18, 2017
@akashdw akashdw reopened this Jan 18, 2017
@himanshug
Copy link
Contributor

👍

@leventov leventov self-assigned this Jan 25, 2017
Copy link
Member

@leventov leventov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before going on with the review I want to ask you (also @himanshug and @cheddar) - so the reason to introduce this complexity is 2 GB limitation of ByteBuffers, what about overcoming this limitation by replacing ByteBuffers with Memory, that is beneficial for other reasons also? #3716 (comment)

@@ -58,18 +59,23 @@
this.valueSupplier = valueSupplier;
}

public static CompressedVSizeIndexedV3Supplier fromByteBuffer(ByteBuffer buffer, ByteOrder order)
public static CompressedVSizeIndexedV3Supplier fromByteBuffer(
ByteBuffer buffer, ByteOrder order,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each param should be on a separate line

@@ -52,7 +51,7 @@ public static int skip(IntIterator it, int n)
* It isn't checked if the given source iterators are actually ascending, if they are not, the order of values in the
* returned iterator is undefined.
* <p>
* This is similar to what {@link MergeIterator} does with simple {@link java.util.Iterator}s.
* This is similar to what {@link io.druid.java.util.common.guava.MergeIterator} does with simple {@link java.util.Iterator}s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line longer than 120 columns

@@ -68,7 +67,7 @@ public static IntIterator mergeAscending(List<IntIterator> iterators)
}

/**
* This class is designed mostly after {@link MergeIterator}. {@code MergeIterator} uses a priority queue of wrapper
* This class is designed mostly after {@link io.druid.java.util.common.guava.MergeIterator}. {@code MergeIterator} uses a priority queue of wrapper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line longer than 120 columns

@@ -36,12 +37,13 @@

public BlockLayoutIndexedFloatSupplier(
int totalSize, int sizePer, ByteBuffer fromBuffer, ByteOrder order,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each param should be on a separate line

{
byte versionFromBuffer = buffer.get();

if (versionFromBuffer == VERSION) {
final int totalSize = buffer.getInt();
final int sizePer = buffer.getInt();
final CompressedObjectStrategy.CompressionStrategy compression = CompressedObjectStrategy.CompressionStrategy.forId(buffer.get());
final CompressedObjectStrategy.CompressionStrategy compression = CompressedObjectStrategy.CompressionStrategy.forId(
buffer.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

); should be on a separate line

public byte[] toBytes(String val)
{
if (val == null) {
return new byte[]{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ByteArrays.EMPTY_ARRAY

if (val == null) {
return new byte[]{};
}
return io.druid.java.util.common.StringUtils.toUtf8(val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why fully qualified?

return ordering.compare(o1, o2);
}
};
private static final byte version_one = 0x1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to keep versions on the very top to of the file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either CAP_CASE or camelCase

}
};
private static final byte version_one = 0x1;
private static final byte version_two = 0X2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small x

this.theBuffer = buffer;
this.strategy = strategy;
this.allowReverseLookup = allowReverseLookup;
size = theBuffer.getInt();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested numElements, to match format description in the header. The same is for other fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexed interface exposes int size();. Using variable name size in parity with this function makes more sense to me.
Also some of the exceptions thrown from this class are also using size in the description.
throw new IAE(String.format("Index[%s] >= size[%s]", index, size));

@akashdw akashdw force-pushed the large_column_support branch from 5ab8861 to d43c39e Compare January 27, 2017 21:34
@akashdw
Copy link
Contributor Author

akashdw commented Jan 27, 2017

@leventov Thanks for reviewing this PR. I updated the PR for code formatting related changes.
Memory changes are not ready yet and can be introduced into GenericIndexed with a new version.

@himanshug
Copy link
Contributor

@leventov yes, it is to introduce larger columns with existing ByteBuffer based GenericIndexed. Replacing ByteBuffer with Memory would achieve same but that would take some time before getting materialized.
Also, this work internally predates the discussions on replacing ByteBuffer with Memory.

@cheddar any other thoughts?

@leventov
Copy link
Member

@akashdw @himanshug If we add a more complex format now, we have to support it for very long or forever, despite this complexity could be avoided if we replace ByteBuffer with Memory. I suggest to not merge this PR and push "ByteBuffer -> Memory" project more actively. See #3892

@niketh
Copy link
Contributor

niketh commented Jan 30, 2017

@leventov I am already working on migrating from ByteBuffer to Memory with @leerho and @cheddar . It is going to take sometime before this task is completed. Also we need to arrive at a consensus on a few things.

@akashdw
Copy link
Contributor Author

akashdw commented Jan 30, 2017

@leventov I understand replacing ByteBuffer with memory would achieve the same but memory changes are not ready yet. @niketh and @leerho are working on the memory changes and might take some time to replace ByteBuffer with memory. I think these changes are completely independent of memory changes and allow us to create large columns using ByteBuffer. As I mentioned earlier memory changes can be introduced later with a new GenericIndexed version 3.

@leventov
Copy link
Member

@niketh thanks for letting know.

@akashdw IMO being able to go beyond 2 GB limit one or two public Druid releases earlier doesn't worth introducing complexity, that should then be supported for years. At Yahoo you need it and you are the author of the patch, so you can apply it and use private builds (if you don't do this already)

@cheddar
Copy link
Contributor

cheddar commented Jan 30, 2017

@leventov I'm not sure I understand the complexity here. There is an explicit need to support large columns, we have users who are actually running into this problem and you are asking us to not merge it because of a PR that isn't even ready yet?

We chose to do these even though we knew that the Memory interface would eventually come because we do not believe the Memory interface to be a simple and quick change. It will introduce API instability, will have performance implications and will require a good amount of back and forth.

I don't think it's a good idea to set a precedent of not accepting patches simply because you, personally, don't need the fix. And the argument about support can be used to say that we should not accept any PR whatsoever. This change is incremental, opt-in only and can clearly and cleanly be migrated forward. I don't think there is any reason to block the PR.

@cheddar
Copy link
Contributor

cheddar commented Jan 30, 2017

@leventov if you are volunteering to complete the move to the Memory interface as your highest priority project and believe you can get it implemented within the next week or two, then I would reconsider and would be happy to let you

push "ByteBuffer -> Memory" project more actively

@leventov
Copy link
Member

@cheddar the fact that other people need this, besides Yahoo, changes the picture. I'm not blocking it, if any other committer not from Yahoo is OK with it.

@cheddar
Copy link
Contributor

cheddar commented Jan 31, 2017

@leventov even if it is only Yahoo at this point in time, I don't see why that's a reason to block it. Btw, as I've calmed down a little bit. Let me add that this PR introduces an interface change to ComplexMetricSerde interface that is actually very important and good. It provides the ability for a ComplexMetricSerde to control its own serialization, opening the door to many optimizations. Without doing the work to make this PR happen, we wouldn't have realized how to actually adjust that interface to make things work. Throwing away this PR essentially throws that away and pulls us backwards in that regard.

In general, Druid has and continues to maintain versioned binary storage to enable us to iterate and make changes as needed. This has served us well and is the mechanism by which we handle complexity and support for features that might initially seem orthogonal to what others also need out of the project. But, in the end, we are running into this problem because we are pushing the limits of Druid and need to resolve those issues. The changes are all within the normal mechanisms that Druid uses to make incremental changes and they are purely opt-in.

If there is some technical reason that you think we are making changes to the code that will have long-lasting, negative implications, please bring them up and we can try to address them. But if the push back is merely "it's more code and more code increases the maintenance burden," I do not believe that argument is valid.

Copy link
Member

@leventov leventov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopped on GenericIndexedWriter.java

* Meta File:
* byte 1: version (0x2)
* byte 2 == 0x1 =>; allowReverseLookup
* bytes 3-6: numberOfElementsPerValueFile expressed as power of 2. That means all the value files contains same number of items
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line longer than 120 cols

@Override
public T get(int index)
{
if (index < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 143-148 replaceable with Preconditions.checkElementIndex()

Copy link
Contributor Author

@akashdw akashdw Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able throw exception as IAE, My intent here is throw IAE exception with custom error message.

Copy link
Member

@leventov leventov Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message of IndexOutOfBoundsException produced by Preconditions is just as informative.

Copy link
Contributor Author

@akashdw akashdw Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of primary thing I have considered while making these changes is not to change existing behavior of V1 GenericIndexed. Current V1 in master is throwing IAE with custom error message and I'm keeping the same behavior. V2 just follows the same convention as V1. May be there can be another PR to replace IAE with preconditions checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then could you please extract this block (repeated several times in GenericIndexed) as a helper method in GenericIndexed and leave a comment on this method, explaining why this method is used instead of Preconditions?

numElementsPerValueFile = size;

int indexOffset = theBuffer.position();
int valuesOffset = theBuffer.position() + (size << 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested * Ints.BYTES

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't <<2 more optimal?

Copy link
Member

@leventov leventov Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Hotspot JIT (if not javac) replaces multiplication or division by a static compile-time constant, which is a power of 2, with shift.

Even if it wasn't so, this constructor is not a hot place where every CPU cycle counts

Copy link
Contributor Author

@akashdw akashdw Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Ints.BYTES is more readable, will change the code to use it.

throw new IAE(String.format("Index[%s] >= size[%s]", index, size));
}

ByteBuffer copyHeaderBuffer = headerBuffer.asReadOnlyBuffer();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested not to create this copy and use a few positioned getInt() calls below on the original headerBuffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the java doc I don't see any copying of data while creating ReadOnlyBuffer, it just ensures write protection on original buffer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I suggest to remove line 150 and then

        if (index == 0) {
          startOffset = 4;
          endOffset = headerBuffer.getInt(0);
        } else {
          int headerPosition = (index - 1) * Ints.BYTES;
          startOffset = headerBuffer.getInt(headerPosition) + 4;
          endOffset = headerBuffer.getInt(headerPosition + 4);
        }
        return _get(valueBuffer.asReadOnlyBuffer(), startOffset, endOffset);

2 lines less in total, less garbage created.

Copy link
Contributor Author

@akashdw akashdw Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will you ensure BufferedIndexed.reads are write protected? BufferedIndexed contract is to fetch element and guarantees no modification in original ByteBuffer. Any bad implementation of an ObjectStrategy can corrupt the original ByteBuffer. I think protecting ByteBuffer write has more benefits than less garbage collection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh my mistake. sure, suggested code ensures write protection. I'll update the PR

final ByteBuffer copyBuffer = valueBuffers.get(fileNum).asReadOnlyBuffer();
final ByteBuffer copyHeaderBuffer = headerBuffer.asReadOnlyBuffer();

if (index < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaceable with Preconditions.checkElementIndex()

Copy link
Contributor Author

@akashdw akashdw Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able throw exception as IAE, My intent here is throw IAE exception with custom error message.

int fileNum = index / numElementsPerValueFile;
final ByteBuffer copyBuffer = copyValueBuffers.get(fileNum);

if (index < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able throw exception as IAE, My intent here is throw IAE exception with custom error message.

final int startOffset;
final int endOffset;

if (index % numElementsPerValueFile == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is identical to some block in BufferedIndexed above, consider extracting a method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept both the blocks separately because version 1 does a check of if (index == 0) vs V2 check is if (index % numElementsPerValueFile == 0), we wanted to avoid % operation for version 1 indexed.

}
};
} else {
final List<ByteBuffer> copyValueBuffers = Lists.newArrayList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating an empty array list and filling it in a for-each loop over valueBuffers takes 4 lines, this block takes 13 lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

throws IOException
{
while (numBytesToPutInFile > 0) {
int bytesRead = is.read(buffer, 0, Math.min(buffer.length, (int) numBytesToPutInFile));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe cast to int, may lead to exception, suggested Ints.saturatedCast(numBytesToPutInFile)

{
while (numBytesToPutInFile > 0) {
int bytesRead = is.read(buffer, 0, Math.min(buffer.length, (int) numBytesToPutInFile));
smooshChannel.write((ByteBuffer) ByteBuffer.wrap(buffer).limit(bytesRead));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be checked bytesRead is not -1

@akashdw akashdw force-pushed the large_column_support branch from f1b4689 to b13ea5d Compare February 14, 2017 00:34
@cheddar
Copy link
Contributor

cheddar commented Feb 14, 2017

@leventov I don't think we are going to make the change to introduce LargeGenericIndexed separate from GenericIndexed. As I mentioned, we originally considered two methods of doing this already and chose the current because GenericIndexed already has the abstraction that we needed in place and introducing another class would basically just be a bunch of C&P. I think there is likely room for further optimization in a subsequent PR, but don't think we should delay this PR anymore on this.

Is that acceptable or are you actively veto'ing this PR?

@leventov
Copy link
Member

@cheddar @akashdw could you please go through my arguments pro separation:

  • Instead of having two big-endian impls now, and adding two native-endian later, you can do LargeGenericIndexed native-endian now and don't introduce any new version of LargeGenericIndexed in the future
  • Remove the unused 4-byte field for each stored object in LargeGenericIndexed
  • If a subsequent PR won't be merged before 0.10.0, the format introduced in this PR is sealed.
  • Separation of GenericIndexed and LargeGenericIndexed doesn't seem to require much development, it's a regrouping of existing code (and making fixes in LargeGenericIndexed suggested above). I don't see a risk of not doing it before 0.10.0
  • I don't see much code is really shared in GenericIndexed now. fromIterable() supports version 1 only. Two separate constructors without shared code. Most other methods just check the format version one way or another and then start executing separate, unshared code for different formats. The only method that is really shared is indexOf(), it could be easily converted to a static accepting any Indexed and called from both GenericIndexed and LargeGenericIndexed.

@cheddar
Copy link
Contributor

cheddar commented Feb 14, 2017

Instead of having two big-endian impls now, and adding two native-endian later, you can do LargeGenericIndexed native-endian now and don't introduce any new version of LargeGenericIndexed in the future

We will likely need to reimplement for the changes for Memory anyway, so spending time right now isn't really that useful. It seems more prudent to me to just introduce things as we have them right now and then in the Memory future, we introduce a new binary version that is all native endian and handles all and any sizes and migrate everything to just use that.

Remove the unused 4-byte field for each stored object in LargeGenericIndexed

This is not a significant storage cost (Parag had a PR out to resolve that and introduce a new version of Generic Indexed that sat without anyone looking at it for months). Introducing that change would also actually have performance implications on indexing, because we currently do not re-process the payloads when we notice that the column needs to become large. In order to switch to that implementation, we would need to start reprocessing then. Once we are in the Memory future and these things collapse together, it will be trivial to remove that.

If a subsequent PR won't be merged before 0.10.0, the format introduced in this PR is sealed.

That's exactly why we version our binary formats. In the future we will introduce another version and move things towards that.

Separation of GenericIndexed and LargeGenericIndexed doesn't seem to require much development, it's a regrouping of existing code (and making fixes in LargeGenericIndexed suggested above). I don't see a risk of not doing it before 0.10.0
I don't see much code is really shared in GenericIndexed now. fromIterable() supports version 1 only. Two separate constructors without shared code. Most other methods just check the format version one way or another and then start executing separate, unshared code for different formats. The only method that is really shared is indexOf(), it could be easily converted to a static accepting any Indexed and called from both GenericIndexed and LargeGenericIndexed.

The code path that I'm talking about is the main contract of the GenericIndexed object, where the get() code is being reused and we are taking advantage of the abstraction that is already in place to switch between the implementations.

If your issues are just generally with how GenericIndexed works and you want to reimplement that, I'm totally down with that and see that as something that can have potential benefits, but I do not think it should be coupled with this PR.

The point of this PR is to introduce the ability to handle columns larger than 2GB in as minimally invasive of a way as possible. In order to do that we also ended up needing to expand the abstraction for ComplexColumnSerde, which we did and has significant benefits for the extension point. We are not trying to innovate on the actual algorithm of GenericIndexed nor fix anything else other than allow for larger columns.

@leventov
Copy link
Member

leventov commented Feb 15, 2017

Ok, I agree not to change LargeGenericIndexed format (not removing the size field and changing ordering)

The code path that I'm talking about is the main contract of the GenericIndexed object, where the get() code is being reused and we are taking advantage of the abstraction that is already in place to switch between the implementations.

This abstraction, BufferedIndexed, could be made a top-level package-private class, copying three fields from GenericIndexed/LargeGenericIndexed: size, strategy, and allowReverseLookup. Both GenericIndexed and LargeGenericIndexed could use it. None other code in the currently proposed version of GenericIndexed is shared between versions.

On the normal GenericIndexed.get(), size(), getClazz() and iterator() path BufferedIndexed is not needed at all, as I noted here: #3743 (comment) (I don't require this to be a part of this PR, can fix it myself later).

If your issues are just generally with how GenericIndexed works and you want to reimplement that, I'm totally down with that and see that as something that can have potential benefits, but I do not think it should be coupled with this PR.

What I currently displeased with is that GenericIndexed looks like a perfect mix of two separate sets of logic, that makes GenericIndexed unnecessarily complex. There should be two different classes.

Your argument against separation is that it's going to be "large C&P of the code", but I actually don't see any code that should be copied.

@leventov
Copy link
Member

leventov commented Feb 15, 2017

Or e. g. AbstractGenericIndexed with three fields: size, strategy and allowsReverseLookup, inner package-private BufferedIndexed class, and static constants. GenericIndexed and LargeGenericIndexed extend it.

@cheddar
Copy link
Contributor

cheddar commented Feb 15, 2017

@leventov if I can repeat what I think you are saying. You are annoyed that GenericIndexed has fields for "both" sets of use cases. I.e. it is maintaining both

private final List<ByteBuffer> valueBuffers;
private final ByteBuffer headerBuffer;
private int logBaseTwoOfElementsPerValueFile;

and

private ByteBuffer theBuffer;

If we can consolidate the fields into just one of those two sets for both uses, is that a reasonable compromise?

Separating it out into separate classes seems like all it is doing is shuffling code around and I'm not sure it's really enabling or solving something other than aligning with a specific aesthetic. While it might be nicer to look at each of those pieces individually, it will also mean that the currently relatively self-contained GenericIndexed will become an amalgam of a number of disparate classes. At the end of the Memory changes, I would hope to have just one primary implementation in use again, which would hopefully just be a new implementation of Indexed without any of this complexity at all. I don't really see a significant benefit to making changes now given that it's not an abstraction that I expect to continue being important in the future nor do I think that the abstraction introduced would enable us to do something new and wonderful.

@leventov
Copy link
Member

@cheddar I'm annoyed not so much with the fields, but with the fact that the logic of the cases is interspersed. I'm ok with a compromise that the logic just inside GenericIndexed is regrouped so that logic for v1 and v2 is grouped together. After all I think better I do it myself in a subsequent PR, along with some other fixes like proposed here: #3743 (comment).

I'll take another (hopefully final) look at the code in this PR later this week

@fjy
Copy link
Contributor

fjy commented Feb 16, 2017

@leventov can we finish this off? This is one of the largest PRs blocking for the release

@leventov
Copy link
Member

@fjy will do before the end of this week

@@ -81,7 +154,17 @@ public void write(T objectToWrite) throws IOException
valuesOut.write(Ints.toByteArray(bytesToWrite.length));
valuesOut.write(bytesToWrite);

headerOut.write(Ints.toByteArray((int) valuesOut.getCount()));
if (!requireMultipleFiles) {
writeIntValueToOutputStream(buf, (int) valuesOut.getCount(), headerOut);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkedCast()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why github is not showing this line as outdated.
19b69f3#diff-37ba90c2e6d6b833ce38fc1ce76eecc3R158

// to current offset.
if ((pos & (numberOfElementsPerValueFile - 1)) == 0) {
relativeRefBytes = currentNumBytes;
numberOfElementsPerValueFile = 1 << bagSizePower;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be removed

}
currentNumBytes = Long.reverseBytes(headerFile.readLong());
relativeNumBytes = currentNumBytes - relativeRefBytes;
helperBuffer.putInt(0, Ints.checkedCast(relativeNumBytes));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the next line are replaceable with

writeIntValueToOutputStream(helperBuffer, Ints.checkedCast(relativeNumBytes), finalHeaderOut);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants