Add formatted textual IR output #3056

zygoloid · 2023-08-03T20:48:23Z

Add a textual IR format to the toolchain.

The exact details of the format are somewhat arbitrary right now, and I expect them to change as we refine the semantics IR model, but at the moment they're somewhat directly following the current structure of the IR.

Semantics tests currently test both the "raw" format, which shows the details of the representation, and the textual format, which is somewhat higher level. We may want to revisit that decision once the textual format is a bit more stable, and test only the textual format in most of these tests, but for now it seems prudent to keep both sets of tests.

Use lowercase IR names. Remove raw output from `dump semantics-ir` mode, and add formatted output to the end of raw mode. Switch tests to testing raw mode so that we verify both.

Make the `dump raw-semantics-ir` output be valid Yaml again.

Undo testdata changes to make merge cleaner.

chandlerc

Sorry I didn't finish a first pass here -- but I left a few comments on the naming scheme and on a few parts of the code. I'll work more on finishing the review as I can.

One thing I wanted to mention is potentially writing up a document in the tree as a reference for the textual format -- would that make sense? Doesn't need to be there to start or right now, I understand wanting to let the format evolve and settle a bit before investing lots in trying to document it. Mostly wanted to write down the thought.

chandlerc · 2023-08-09T07:40:58Z

toolchain/semantics/semantics_ir.h

@@ -261,10 +262,12 @@ class SemanticsIR {
  }

  // Produces a string version of a type.


Add a comment to document how in_type_context impacts this?

toolchain/parser/testdata/index/fail_malformed_expr.carbon

toolchain/semantics/semantics_ir_formatter.cpp

chandlerc · 2023-08-09T07:46:20Z

toolchain/semantics/semantics_ir_formatter.cpp

+      GetScopeInfo(fn_scope).name =
+          "@" +
+          globals.AllocateName(*this,
+                               /*TODO*/ ParseTree::Node::Invalid,


Maybe add a bit of context for what this should be doing?

chandlerc · 2023-08-09T07:48:10Z

toolchain/semantics/semantics_ir_formatter.cpp

+      // This should not happen in valid IR.
+      return "<noderef " + llvm::itostr(node_id.index) + ">";


What do you think about here and elsewhere something like:

Suggested change

// This should not happen in valid IR.

return "<noderef " + llvm::itostr(node_id.index) + ">";

// This should not happen in valid IR.

return "<invalid noderef " + llvm::itostr(node_id.index) + ">";

Done, went with "unexpected" rather than "invalid" because the node reference itself is valid (references a valid node), but not one that we expected to be referenced by ID.

chandlerc · 2023-08-09T07:50:15Z

toolchain/semantics/semantics_ir_formatter.cpp

+        if (allocated.insert(name).second) {
+          return name;
+        }
+        name += ".";


I think it's useful to distinguish the column number from a non-location disambiguator.

I know it makes the name longer, but maybe:

Suggested change

name += ".";

name += ".c";

I also find c a bit easier to separate from the number than C, and I'd probably use l for consistency, although there both l and L are hard to visually separate. Somewhat minor though.

Format updated per in-person discussion.

chandlerc · 2023-08-09T07:53:51Z

toolchain/semantics/semantics_ir_formatter.cpp

+        if (allocated.insert(name).second) {
+          return name;
+        }


Looking at the examples, I've found it a bit hard to use the line number alone when it is just the first entity on the line and subsequent ones have columns.

The reason is because I haven't seen the subsequent names yet when I see the first one without any column suffix, and the line given has a bunch of different constructs. I have to go find the next N things on that line, and eliminate those based on column number until one is remaining.

Ultimately, I think when there are multiple entries on a line, it's better to just put the column number in all of them.

The same thing comes up with multiple entries at the same column, but I didn't run into that as much reading examples -- not sure if that's just because it doesn't happen much or because it isn't as hard to disambiguate. But there also isn't anything to be done in that case.

Switched to using the shortest unambiguous name for an entity rather than giving a name to the first entity that wants it.

If we need to disambiguate, add the disambiguator to all versions of the name including ones that appeared earlier. Use subscript numbers as the final disambiguator. This is overall aimed to make names shorter, more visually distinctive, and more readable.

chandlerc

LGTM with some non-blocking comments and some suggestions / alternatives below.

Mostly, I think this is a good initial cut. There is lots we can do to play with the format and other things to refine until it stabilizes a bit. But seems better to get an initial version into the tree and iterate / fix-forward.

toolchain/semantics/semantics_ir_formatter.cpp

chandlerc · 2023-08-10T07:21:32Z

toolchain/semantics/semantics_ir_formatter.cpp

+      operator llvm::StringRef() const { return str(); }
+      operator llvm::Twine() const { return str(); }
+      operator std::string() const { return str().str(); }


I'm surprised all of these are needed rather than just StringRef? Extra surprised by making std::string construction be implicit.

(Happy with either learning why these are needed or narrowing to just StringRef, not really intended to be blocking.)

These were an attempt to remove the ugly .str().str() calls, converting explicitly first to StringRef and then to std::string. But in the end, we only do that twice, which I think means these don't pull their weight. Removed and added the .str().str()s instead.

toolchain/semantics/semantics_ir_formatter.cpp

chandlerc · 2023-08-10T07:30:52Z

toolchain/semantics/semantics_ir_formatter.cpp

+      };
+
+      // All names start with the prefix.
+      name.insert(0, prefix.data(), prefix.size());


Does the StringViewLike .insert not work with llvm::StringRef? If so, that's... annoying.

I didn't realize we were using a new enough standard library for that, and it does! Thanks.

chandlerc · 2023-08-10T07:41:38Z

toolchain/semantics/semantics_ir_formatter.cpp

+      // Append location information to try to disambiguate.
+      if (node.is_valid()) {
+        auto token = namer.parse_tree_.node_token(node);
+        name += ".loc";
+        name += llvm::itostr(namer.tokenized_buffer_.GetLineNumber(token));
+        add_name();
+
+        name += "_";
+        name += llvm::itostr(namer.tokenized_buffer_.GetColumnNumber(token));
+        add_name();
+      }
+
+      // Append numbers until we find an available name.
+      name += ".";
+      auto name_size_without_counter = name.size();
+      for (int counter = 1;; ++counter) {
+        name.resize(name_size_without_counter);
+        name += llvm::itostr(counter);
+        if (add_name(/*mark_ambiguous=*/false)) {
+          return best;
+        }
+      }


Would using a string-stream be clearer? It seems a tiny bit better to me, but optional if you prefer as-is:

Suggested change

// Append location information to try to disambiguate.

if (node.is_valid()) {

auto token = namer.parse_tree_.node_token(node);

name += ".loc";

name += llvm::itostr(namer.tokenized_buffer_.GetLineNumber(token));

add_name();

name += "_";

name += llvm::itostr(namer.tokenized_buffer_.GetColumnNumber(token));

add_name();

}

// Append numbers until we find an available name.

name += ".";

auto name_size_without_counter = name.size();

for (int counter = 1;; ++counter) {

name.resize(name_size_without_counter);

name += llvm::itostr(counter);

if (add_name(/*mark_ambiguous=*/false)) {

return best;

}

}

// Append location information to try to disambiguate.

llvm::raw_string_ostream name_os(name);

if (node.is_valid()) {

auto token = namer.parse_tree_.node_token(node);

name_os << ".loc" << namer.tokenized_buffer_.GetLineNumber(token);

add_name();

name_os << "_" << namer.tokenized_buffer_.GetColumnNumber(token);

add_name();

}

// Append numbers until we find an available name.

name_os << ".";

auto name_size_without_counter = name.size();

for (int counter = 1;; ++counter) {

name.resize(name_size_without_counter);

name_os << counter;

if (add_name(/*mark_ambiguous=*/false)) {

return best;

}

}

I'm uncomfortable about modifying the std::string while it's being held by the ostream (even though it looks like raw_string_ostream doesn't cache anything about the string between output calls right now).

Done, creating the stream more frequently.

chandlerc · 2023-08-10T07:44:34Z

toolchain/semantics/semantics_ir_formatter.cpp

+    // Sequentially number all remaining values.
+    for (auto node_id : semantics_ir_.GetNodeBlock(block_id)) {
+      auto node = semantics_ir_.GetNode(node_id);
+      if (node.kind() != SemanticsNodeKind::BindName &&
+          node.kind().value_kind() != SemanticsNodeValueKind::None) {


Curious why this can't be part of the above loop over the nodes? Either adding that to the comment or merging are fine, mostly wasn't obvious to me when reading.

Added comment.

chandlerc · 2023-08-10T07:52:09Z

toolchain/semantics/semantics_ir_formatter.cpp

+  // BindName is handled by the NodeNamer and doesn't appear in the output.
+  template <>
+  auto FormatInstruction<SemanticsNode::BindName>(SemanticsNodeId,
+                                                  SemanticsNode) -> void {}


I'm a bit surprised at this. If we have nodes to track the name binding, why not show them in the IR? I understand that we'll also actually use the bound name, just somewhat curious about the rationale here.

(Not blocking at all.)

The only purpose of a BindName node in the IR currently is to give a name to some other node. We never reference BindName nodes or use them for any purpose other than as an annotation to use a specific name for another node, so including them in the IR would only add noise.

I've extended the comment here to explain.

Co-authored-by: Chandler Carruth <[email protected]>

This test has its own directory, since #3056. We haven't added any more, so fold it into basics. Also simplify it a little using `else`, and `no_prelude`. Note, I'm not even sure how much we need this test given the `%.loc<line>_<col>`, but I feel slightly worse deleting it.

zygoloid added 14 commits August 2, 2023 17:16

Basic formatting.

e87441a

Node and label naming.

af8035d

Add formatting for references to builtins and for struct types.

cef0ffa

Fix formatting of struct member access.

0405e2c

Remove fallback formatting and add missing FormatArg overloads.

aabb504

Improve printing of tuples and calls.

c11bffc

Improve robustness.

60159b9

Move node name formatting up a level.

e3b4cc9

Add more vertical space after a forward declaration of a function.

9d64ff5

Fix up after rebase.

15df6f2

Remove redundant "as type" printing.

6b427e6

Clean up output format.

86a3b02

Use lowercase IR names. Remove raw output from `dump semantics-ir` mode, and add formatted output to the end of raw mode. Switch tests to testing raw mode so that we verify both.

Deduplicate names, and use name qualifiers when necessary.

4ced618

pre-commit

5aa71d5

zygoloid requested a review from chandlerc August 3, 2023 20:48

github-actions bot added the toolchain label Aug 3, 2023

Clean up node value kind representation.

fa5ce9f

zygoloid marked this pull request as ready for review August 3, 2023 21:44

github-actions bot requested a review from jonmeow August 3, 2023 21:44

zygoloid added 8 commits August 3, 2023 14:48

Merge branch 'trunk' into toolchain-ir-output

9340b93

Fix :semantics_ir_test.

ba78229

Make the `dump raw-semantics-ir` output be valid Yaml again.

Use source location to identify values.

e43e822

Tweak some instruction names.

2e961c2

pre-commit

3a85110

Merge branch 'trunk' into toolchain-ir-output

e4ea8b8

Undo testdata changes to make merge cleaner.

Regenerate test expectations.

8745782

Use .L prefix to introduce a line number.

cc45c58

chandlerc reviewed Aug 9, 2023

View reviewed changes

zygoloid added 2 commits August 9, 2023 14:30

Switch to Chandler's preferred syntax.

0838615

zygoloid added 3 commits August 9, 2023 15:52

Address some review comments.

fc1ca6d

Merge branch 'trunk' into toolchain-ir-output

85c0e88

Autoupdate after merge.

36e541d

chandlerc approved these changes Aug 10, 2023

View reviewed changes

zygoloid and others added 6 commits August 10, 2023 11:12

Responses to review comments.

332bfda

Update toolchain/semantics/semantics_ir_formatter.cpp

47421d6

Co-authored-by: Chandler Carruth <[email protected]>

Update toolchain/semantics/semantics_ir_formatter.cpp

1c0181d

Co-authored-by: Chandler Carruth <[email protected]>

Merge branch 'trunk' into toolchain-ir-output

a749bca

Regenerate test expectations.

24ba19e

pre-commit

ecc72e9

zygoloid force-pushed the toolchain-ir-output branch from b436b51 to ecc72e9 Compare August 10, 2023 19:20

zygoloid enabled auto-merge August 10, 2023 19:21

zygoloid added this pull request to the merge queue Aug 10, 2023

Merged via the queue into carbon-language:trunk with commit 6cbf280 Aug 10, 2023

zygoloid deleted the toolchain-ir-output branch August 10, 2023 19:56

jonmeow mentioned this pull request May 21, 2025

Move the ir test #5510

Merged

		@@ -261,10 +262,12 @@ class SemanticsIR {
		}

		// Produces a string version of a type.

		// This should not happen in valid IR.
		return "<noderef " + llvm::itostr(node_id.index) + ">";

Add formatted textual IR output #3056

Add formatted textual IR output #3056

Uh oh!

Conversation

zygoloid commented Aug 3, 2023

Uh oh!

chandlerc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chandlerc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!