Skip to content

Add formatted textual IR output #3056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Aug 10, 2023

Conversation

zygoloid
Copy link
Contributor

@zygoloid zygoloid commented Aug 3, 2023

Add a textual IR format to the toolchain.

The exact details of the format are somewhat arbitrary right now, and I expect them to change as we refine the semantics IR model, but at the moment they're somewhat directly following the current structure of the IR.

Semantics tests currently test both the "raw" format, which shows the details of the representation, and the textual format, which is somewhat higher level. We may want to revisit that decision once the textual format is a bit more stable, and test only the textual format in most of these tests, but for now it seems prudent to keep both sets of tests.

@zygoloid zygoloid requested a review from chandlerc August 3, 2023 20:48
@zygoloid zygoloid marked this pull request as ready for review August 3, 2023 21:44
@github-actions github-actions bot requested a review from jonmeow August 3, 2023 21:44
Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't finish a first pass here -- but I left a few comments on the naming scheme and on a few parts of the code. I'll work more on finishing the review as I can.

One thing I wanted to mention is potentially writing up a document in the tree as a reference for the textual format -- would that make sense? Doesn't need to be there to start or right now, I understand wanting to let the format evolve and settle a bit before investing lots in trying to document it. Mostly wanted to write down the thought.

@@ -261,10 +262,12 @@ class SemanticsIR {
}

// Produces a string version of a type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to document how in_type_context impacts this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

GetScopeInfo(fn_scope).name =
"@" +
globals.AllocateName(*this,
/*TODO*/ ParseTree::Node::Invalid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a bit of context for what this should be doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 85 to 86
// This should not happen in valid IR.
return "<noderef " + llvm::itostr(node_id.index) + ">";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about here and elsewhere something like:

Suggested change
// This should not happen in valid IR.
return "<noderef " + llvm::itostr(node_id.index) + ">";
// This should not happen in valid IR.
return "<invalid noderef " + llvm::itostr(node_id.index) + ">";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, went with "unexpected" rather than "invalid" because the node reference itself is valid (references a valid node), but not one that we expected to be referenced by ID.

if (allocated.insert(name).second) {
return name;
}
name += ".";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful to distinguish the column number from a non-location disambiguator.

I know it makes the name longer, but maybe:

Suggested change
name += ".";
name += ".c";

I also find c a bit easier to separate from the number than C, and I'd probably use l for consistency, although there both l and L are hard to visually separate. Somewhat minor though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format updated per in-person discussion.

Comment on lines 130 to 132
if (allocated.insert(name).second) {
return name;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the examples, I've found it a bit hard to use the line number alone when it is just the first entity on the line and subsequent ones have columns.

The reason is because I haven't seen the subsequent names yet when I see the first one without any column suffix, and the line given has a bunch of different constructs. I have to go find the next N things on that line, and eliminate those based on column number until one is remaining.

Ultimately, I think when there are multiple entries on a line, it's better to just put the column number in all of them.

The same thing comes up with multiple entries at the same column, but I didn't run into that as much reading examples -- not sure if that's just because it doesn't happen much or because it isn't as hard to disambiguate. But there also isn't anything to be done in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using the shortest unambiguous name for an entity rather than giving a name to the first entity that wants it.

If we need to disambiguate, add the disambiguator to all versions of the
name including ones that appeared earlier.

Use subscript numbers as the final disambiguator.

This is overall aimed to make names shorter, more visually distinctive,
and more readable.
Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some non-blocking comments and some suggestions / alternatives below.

Mostly, I think this is a good initial cut. There is lots we can do to play with the format and other things to refine until it stabilizes a bit. But seems better to get an initial version into the tree and iterate / fix-forward.

Comment on lines 138 to 140
operator llvm::StringRef() const { return str(); }
operator llvm::Twine() const { return str(); }
operator std::string() const { return str().str(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised all of these are needed rather than just StringRef? Extra surprised by making std::string construction be implicit.

(Happy with either learning why these are needed or narrowing to just StringRef, not really intended to be blocking.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were an attempt to remove the ugly .str().str() calls, converting explicitly first to StringRef and then to std::string. But in the end, we only do that twice, which I think means these don't pull their weight. Removed and added the .str().str()s instead.

};

// All names start with the prefix.
name.insert(0, prefix.data(), prefix.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the StringViewLike .insert not work with llvm::StringRef? If so, that's... annoying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize we were using a new enough standard library for that, and it does! Thanks.

Comment on lines 200 to 221
// Append location information to try to disambiguate.
if (node.is_valid()) {
auto token = namer.parse_tree_.node_token(node);
name += ".loc";
name += llvm::itostr(namer.tokenized_buffer_.GetLineNumber(token));
add_name();

name += "_";
name += llvm::itostr(namer.tokenized_buffer_.GetColumnNumber(token));
add_name();
}

// Append numbers until we find an available name.
name += ".";
auto name_size_without_counter = name.size();
for (int counter = 1;; ++counter) {
name.resize(name_size_without_counter);
name += llvm::itostr(counter);
if (add_name(/*mark_ambiguous=*/false)) {
return best;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using a string-stream be clearer? It seems a tiny bit better to me, but optional if you prefer as-is:

Suggested change
// Append location information to try to disambiguate.
if (node.is_valid()) {
auto token = namer.parse_tree_.node_token(node);
name += ".loc";
name += llvm::itostr(namer.tokenized_buffer_.GetLineNumber(token));
add_name();
name += "_";
name += llvm::itostr(namer.tokenized_buffer_.GetColumnNumber(token));
add_name();
}
// Append numbers until we find an available name.
name += ".";
auto name_size_without_counter = name.size();
for (int counter = 1;; ++counter) {
name.resize(name_size_without_counter);
name += llvm::itostr(counter);
if (add_name(/*mark_ambiguous=*/false)) {
return best;
}
}
// Append location information to try to disambiguate.
llvm::raw_string_ostream name_os(name);
if (node.is_valid()) {
auto token = namer.parse_tree_.node_token(node);
name_os << ".loc" << namer.tokenized_buffer_.GetLineNumber(token);
add_name();
name_os << "_" << namer.tokenized_buffer_.GetColumnNumber(token);
add_name();
}
// Append numbers until we find an available name.
name_os << ".";
auto name_size_without_counter = name.size();
for (int counter = 1;; ++counter) {
name.resize(name_size_without_counter);
name_os << counter;
if (add_name(/*mark_ambiguous=*/false)) {
return best;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncomfortable about modifying the std::string while it's being held by the ostream (even though it looks like raw_string_ostream doesn't cache anything about the string between output calls right now).

Done, creating the stream more frequently.

Comment on lines +272 to +276
// Sequentially number all remaining values.
for (auto node_id : semantics_ir_.GetNodeBlock(block_id)) {
auto node = semantics_ir_.GetNode(node_id);
if (node.kind() != SemanticsNodeKind::BindName &&
node.kind().value_kind() != SemanticsNodeValueKind::None) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this can't be part of the above loop over the nodes? Either adding that to the comment or merging are fine, mostly wasn't obvious to me when reading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

Comment on lines 433 to 436
// BindName is handled by the NodeNamer and doesn't appear in the output.
template <>
auto FormatInstruction<SemanticsNode::BindName>(SemanticsNodeId,
SemanticsNode) -> void {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised at this. If we have nodes to track the name binding, why not show them in the IR? I understand that we'll also actually use the bound name, just somewhat curious about the rationale here.

(Not blocking at all.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only purpose of a BindName node in the IR currently is to give a name to some other node. We never reference BindName nodes or use them for any purpose other than as an annotation to use a specific name for another node, so including them in the IR would only add noise.

I've extended the comment here to explain.

@zygoloid zygoloid force-pushed the toolchain-ir-output branch from b436b51 to ecc72e9 Compare August 10, 2023 19:20
@zygoloid zygoloid enabled auto-merge August 10, 2023 19:21
@zygoloid zygoloid added this pull request to the merge queue Aug 10, 2023
Merged via the queue into carbon-language:trunk with commit 6cbf280 Aug 10, 2023
@zygoloid zygoloid deleted the toolchain-ir-output branch August 10, 2023 19:56
@jonmeow jonmeow mentioned this pull request May 21, 2025
github-merge-queue bot pushed a commit that referenced this pull request May 27, 2025
This test has its own directory, since #3056. We haven't added any more,
so fold it into basics. Also simplify it a little using `else`, and
`no_prelude`.

Note, I'm not even sure how much we need this test given the
`%.loc<line>_<col>`, but I feel slightly worse deleting it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants