Skip to content

8342103: C2 compiler support for Float16 type and associated operations #21490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Oct 14, 2024

Hi All,

This patch adds C2 compiler support for various Float16 operations added by PR#22128

Following is the summary of changes included with this patch:-

  1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
  2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
  3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
    • These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
  4. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
  5. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to FAQs for more details.
  6. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA instructions generally operate over floating point registers, therefore compiler injectes reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
  7. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
  8. Auto-vectorization of newly supported scalar operations.
  9. X86 and AARCH64 backend implementation for all supported intrinsics.
  10. Functional and Performance validation tests.

Kindly review and share your feedback.

Best Regards,
Jatin


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Integration blocker

 ⚠️ Title mismatch between PR and JBS for issue JDK-8342103

Issue

  • JDK-8342103: C2 compiler support for Float16 type and associated scalar operations (Enhancement - P4) ⚠️ Title mismatch between PR and JBS.

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21490/head:pull/21490
$ git checkout pull/21490

Update a local copy of the PR:
$ git checkout pull/21490
$ git pull https://git.openjdk.org/jdk.git pull/21490/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21490

View PR using the GUI difftool:
$ git pr show -t 21490

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21490.diff

Using Webrev

Link to Webrev Comment

@jatin-bhateja jatin-bhateja marked this pull request as draft October 14, 2024 11:40
@bridgekeeper
Copy link

bridgekeeper bot commented Oct 14, 2024

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 14, 2024

@jatin-bhateja This change is no longer ready for integration - check the PR body for details.

@openjdk
Copy link

openjdk bot commented Oct 14, 2024

@jatin-bhateja The following labels will be automatically applied to this pull request:

  • core-libs
  • graal
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@Bhavana-Kilambi
Copy link
Contributor

Can we add the JMH micro benchmark that you added recently for FP16 as well ? or has it intentionally not been included?

@Bhavana-Kilambi
Copy link
Contributor

Bhavana-Kilambi commented Oct 14, 2024

Hi Jatin, could you also include the idealization tests here - test/hotspot/jtreg/compiler/c2/irTests/MulHFNodeIdealizationTests.java and ConvF2HFIdealizationTests.java in this PR?

@PaulSandoz
Copy link
Member

We should move the Float16 class to jdk.incubator.vector and relevant intrinsic stuff to jdk.internal.vm.vector, and we don't need the changes to BigDecimal and BigInteger.

*/

/*
* @test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jatin, is there any reason why these have been kept under the Float folder and not a separate Float16 folder?

@jddarcy
Copy link
Member

jddarcy commented Oct 17, 2024

We should move the Float16 class to jdk.incubator.vector and relevant intrinsic stuff to jdk.internal.vm.vector, and we don't need the changes to BigDecimal and BigInteger.

To expand on that point, a few weeks back I took a look at what porting Float16 from java.lang in the lworld+fp16 branch of Valhalla to the jdk.incubator.vector package in JDK 24 would look like: the result were favorable and the diffs are attached to JDK-8341260.

Before the work in this PR proceeds, I think the java.lang -> jdk.incubator.vector move of Float16 should occur first. This will allow leaner reviews and better API separation. I can get an updated PR of the move prepared within the next few days.

@PaulSandoz
Copy link
Member

Before the work in this PR proceeds, I think the java.lang -> jdk.incubator.vector move of Float16 should occur first. This will allow leaner reviews and better API separation. I can get an updated PR of the move prepared within the next few days.

Good point, we should separate the Java changes from the intrinsic + HotSpot changes.

@jddarcy
Copy link
Member

jddarcy commented Oct 18, 2024

Before the work in this PR proceeds, I think the java.lang -> jdk.incubator.vector move of Float16 should occur first. This will allow leaner reviews and better API separation. I can get an updated PR of the move prepared within the next few days.

Good point, we should separate the Java changes from the intrinsic + HotSpot changes.

PS Along those lines, see

#21574

for a non-intrinsified port of Float16 to the vector API.

@jatin-bhateja
Copy link
Member Author

/contributor add @Bhavana-Kilambi

@jatin-bhateja
Copy link
Member Author

/contributor add @jddarcy

@openjdk
Copy link

openjdk bot commented Oct 19, 2024

@jatin-bhateja
Contributor Bhavana Kilambi <[email protected]> successfully added.

@jatin-bhateja
Copy link
Member Author

/contributor add @PaulSandoz

@openjdk
Copy link

openjdk bot commented Oct 19, 2024

@jatin-bhateja this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout float16_support
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Oct 19, 2024
@openjdk
Copy link

openjdk bot commented Oct 19, 2024

@jatin-bhateja
Contributor Joe Darcy <[email protected]> successfully added.

@openjdk
Copy link

openjdk bot commented Oct 19, 2024

@jatin-bhateja
Contributor Paul Sandoz <[email protected]> successfully added.

@jatin-bhateja
Copy link
Member Author

/contributor add @rgiulietti

@openjdk
Copy link

openjdk bot commented Oct 19, 2024

@jatin-bhateja
Contributor Raffaello Giulietti <[email protected]> successfully added.

@rose00
Copy link
Contributor

rose00 commented Oct 20, 2024

As I noted on Joe's PR, I like the fact that the intrinsics are decoupled from the box class.

I'm now wondering if there is another simplification possible (as I claimed to Joe!) which is to reduce the number of intrinsics, ideally down to conversions (to and from HF).

For example, sqrt_float16 is an intrinsic, but I think it could be just an invisible IR node. After inlining the Java definition, you start with an IR graph that mentions sqrtD and is surrounded by conversion nodes. Then you refactor the IR graph to use sqrt_float16 directly, presumably with fewer conversions (and/or reinterprets).

Same argument for max, min, add, mul, etc.

I'm not saying the current PR is wrong, but I would like to know if it could be simplified, either now or later.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks for tackling this!

Ok, lots of style comments.

But again:
I would have loved to see this split up into these parts:

  • Scalar
  • Scalar optimizations (value, ideal, identity)
  • Vector

This will again take many many week to get reviewed because it is a 3k+ change with lots of details.

Do you have any tests for the scalar constant folding optimizations? I did not find them.

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Nov 25, 2024

Wow, thanks for tackling this!

Ok, lots of style comments.

But again: I would have loved to see this split up into these parts:

  • Scalar
  • Scalar optimizations (value, ideal, identity)
  • Vector

This will again take many many week to get reviewed because it is a 3k+ change with lots of details.

Do you have any tests for the scalar constant folding optimizations? I did not find them.

Hey @eme64 ,

The patch includes IR framework-based scalar constant folding test points.

@IR(counts = {IRNode.ADD_HF, " 0 ", IRNode.REINTERPRET_S2HF, " 0 ", IRNode.REINTERPRET_HF2S, " 0 "},

Regarding vector operation inferencing, we are taking the standard route by adding new Vector IR and associated VectorNode::Opcode / making routine changes without changing the auto-vectorization core. Each new vector operation is backed by IR framework-based tests.
https://github.com/openjdk/jdk/pull/21490/files#diff-30af2f4d6a92733f58967b0feab21ddbc58a8f1ac5d3d5660c0f60220f6fab0dR40

Our target is to get this integrated before JDK24-RDP1, your help and reviews will be highly appreciated.

Best Regards

@eme64
Copy link
Contributor

eme64 commented Nov 25, 2024

I heard no argument about why you did not split this up. Please do that in the future. It is hard to review well when there is this much code. If it is really necessary, then sure. Here it does not seem necessary to deliver all at once.

The patch includes IR framework-based scalar constant folding test points.
You mention this IR test:
https://github.com/openjdk/jdk/pull/21490/files#diff-3f8786f9f62662eda4b4a5c76c01fa04534c94d870d496501bfc20434ad45579R169-R174

Here I only see the use of very trivial values. I think we need more complicated cases.

What about these:

  • Add/Sub/Mul/Div/Min/Max ... with NaN and infinity.
  • Same where it would overflow the FP16 range.
  • Negative zero tests.
  • Division by powers of 2.

It would for example be nice if you could iterate over all inputs. FP16 with 2 inputs is only 32bits, that can be iterated in just a few seconds. Then you can run the computation with constants in the interpreter, and compare to the results in compiled code.

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Nov 25, 2024

I heard no argument about why you did not split this up. Please do that in the future. It is hard to review well when there is this much code. If it is really necessary, then sure. Here it does not seem necessary to deliver all at once.

The patch includes IR framework-based scalar constant folding test points.
You mention this IR test:
https://github.com/openjdk/jdk/pull/21490/files#diff-3f8786f9f62662eda4b4a5c76c01fa04534c94d870d496501bfc20434ad45579R169-R174

Here I only see the use of very trivial values. I think we need more complicated cases.

What about these:

  • Add/Sub/Mul/Div/Min/Max ... with NaN and infinity.
  • Same where it would overflow the FP16 range.
  • Negative zero tests.
  • Division by powers of 2.

It would for example be nice if you could iterate over all inputs. FP16 with 2 inputs is only 32bits, that can be iterated in just a few seconds. Then you can run the computation with constants in the interpreter, and compare to the results in compiled code.

ScalarFloat16OperationsTest.java
Adds has a specialized data provider that generates test vectors with special values, our functional validation is covering the entire Float16 value range.

@eme64
Copy link
Contributor

eme64 commented Nov 26, 2024

@jatin-bhateja

ScalarFloat16OperationsTest.java
Adds has a specialized data provider that generates test vectors with special values, our functional validation is covering the entire Float16 value range.

Maybe I'm not making myself clear here. The test vectors will never constant fold - the values you read from an array load will always be the full range of their type, and not a constant. And you added constant folding IGVN optimizations.

So we should test both:

  • Compile-time variables: for this you can use array element loads. You have to generate the values randomly beforehand, spanning the whole Float16 value range. This I think is covered somewhat adequately.
  • Compile-time constants: for this you cannot use array element loads - they will not be constants. You have to use literals, or you can set static final int val = RANDOM.nextInt();, which will constant fold during compilation, or you can use MethodHandles.constant(int.class, 1); to get compile-time constants, that you can change and trigger recompilation with the new "constant".

It starts with something as simple as your constant folding of addition:

// Supplied function returns the sum of the inputs.
// This also type-checks the inputs for sanity.  Guaranteed never to
// be passed a TOP or BOTTOM type, these are filtered out by pre-check.
const Type* AddHFNode::add_ring(const Type* t0, const Type* t1) const {
  if (!t0->isa_half_float_constant() || !t1->isa_half_float_constant()) {
    return bottom_type();
  }
  return TypeH::make(t0->getf() + t1->getf());
}

Which uses this code:

const TypeH *TypeH::make(float f) {
  assert( StubRoutines::f2hf_adr() != nullptr, "");
  short hf = StubRoutines::f2hf(f);
  return (TypeH*)(new TypeH(hf))->hashcons();
}

You are doing the addition in float, and then casting back to half_float. Probably correct. But does it do the rounding correctly? Does it deal with infty and NaN correctly? Probably, but I would like to see tests for that.

This is the simple stuff. Then there are more complex cases.

const Type* MinHFNode::add_ring(const Type* t0, const Type* t1) const {
  const TypeH* r0 = t0->isa_half_float_constant();
  const TypeH* r1 = t1->isa_half_float_constant();
  if (r0 == nullptr || r1 == nullptr) {
    return bottom_type();
  }

  if (r0->is_nan()) {
    return r0;
  }
  if (r1->is_nan()) {
    return r1;
  }

  float f0 = r0->getf();
  float f1 = r1->getf();
  if (f0 != 0.0f || f1 != 0.0f) {
    return f0 < f1 ? r0 : r1;
  }

  // As per IEEE 754 specification, floating point comparison consider +ve and -ve
  // zeros as equals. Thus, performing signed integral comparison for max value
  // detection.
  return (jint_cast(f0) < jint_cast(f1)) ? r0 : r1;
}

Is this adequately tested over the whole range of inputs? Of course the inputs have to be constant, otherwise if you only do array loads, the values are obviously variable, i.e. they would fail at the isa_half_float_constant check.

You do have some constant folding tests like this:

    @Test
    @IR(counts = {IRNode.MIN_HF, " 0 ", IRNode.REINTERPRET_S2HF, " 0 ", IRNode.REINTERPRET_HF2S, " 0 "},
        applyIfCPUFeature = {"avx512_fp16", "true"})
    public void testMinConstantFolding() {
        assertResult(min(valueOf(1.0f), valueOf(2.0f)).floatValue(), 1.0f, "testMinConstantFolding");
        assertResult(min(valueOf(0.0f), valueOf(-0.0f)).floatValue(), -0.0f, "testMinConstantFolding");
    }

But this is only 2 examples for min. It does not cover all cases by a long shot. It covers 2 "nice" cases.

I do not think that is sufficient. Often the bugs are hiding in special cases.

Testing is really important to me. I've made the experience myself where I did not test optimizations well and later it can turn into a bug.

Comments like these do not give me much confidence:

functional validation is covering the entire Float16 value range.

What do you think @Bhavana-Kilambi @PaulSandoz ?

@eme64
Copy link
Contributor

eme64 commented Nov 26, 2024

Another example where I asked if we have good tests:
image

And the test you point to is this:
image

It only covers a single constant divisor = 8. But what about divisors that are out of the allowed range, or not powers of 2? How do we know that you chose the bounds correctly, and are not off-by-1? And what about negative divisors?
image

@eme64
Copy link
Contributor

eme64 commented Nov 26, 2024

@jatin-bhateja

I can feel the reviewer's pain

Then please do something about it!
(edit: I mean to say that it would be nice if you made the reviewer pain as small as possible. For me, that would mean checking methodically if all your optimizations have sufficient tests that cover the input range well. And splitting PRs into smaller PRs that are more easy to review.)

Your comments are helpful. But they do not answer my request for better test coverage.

Yes, gtest would be helpful.
But also Java end-to-end tests are required.

set_result(_gvn.transform(new ReinterpretHF2SNode(result)));
return true;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed?

@jatin-bhateja jatin-bhateja marked this pull request as draft November 29, 2024 10:47
@openjdk openjdk bot removed the rfr Pull request is ready for review label Nov 29, 2024
}

@Benchmark
public short cosineSimilarityDoubleRoundingFP16() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jatin-bhateja , I understand that the PR is in draft stage but just wanted to put out this comment so that these changes (if needed) are included in your next PS as well :)
This test fails with an "invalid cast type" error in this function -

fatal("unreachable. Invalid cast type.");
. When it tries to compute macRes at line #238 the compiler generates a ConstraintCastNode and it fails to match the half_float type. We need to add isa_half_float() as another condition in this routine and define a new Cast node for half-float.

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Dec 3, 2024
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Dec 3, 2024
ins_pipe( pipe_slow );
%}

instruct ReplHF_reg(vec dst, regF src, rRegI rtmp) %{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jatin-bhateja , there are JTREG tests testing Replicate with immediate FP16 values (Rep1HF_imm backend rule) but I noticed that there are no JTREG tests for testing these rules - Rep1HF_reg and Rep1HF_short_reg. Should those be added as well? Something like -

float f = 10.0f;
for (int i = 0; i < SIZE: ++i) {
    res[i] = Float16.fma(a[i], b[i], Float16.valueOf(f));   // where f is a loop invariant float variable defined out of the loop 
}

@eme64
Copy link
Contributor

eme64 commented Dec 10, 2024

@jatin-bhateja just ping me here if you think I should have a look at it again ;)

@jatin-bhateja
Copy link
Member Author

This PR will split into separate scalar and vectorization support to ease the review process. Closing this PR.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja just ping me here if you think I should have a look at it again ;)
Hi @eme64
As suggested, I have split this patch into separate PRs

  • scalar operation support: pull/22754
  • vector operation support: pull/22755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

8 participants