-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add formatted textual IR output #3056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add formatted textual IR output #3056
Conversation
Use lowercase IR names. Remove raw output from `dump semantics-ir` mode, and add formatted output to the end of raw mode. Switch tests to testing raw mode so that we verify both.
Make the `dump raw-semantics-ir` output be valid Yaml again.
Undo testdata changes to make merge cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't finish a first pass here -- but I left a few comments on the naming scheme and on a few parts of the code. I'll work more on finishing the review as I can.
One thing I wanted to mention is potentially writing up a document in the tree as a reference for the textual format -- would that make sense? Doesn't need to be there to start or right now, I understand wanting to let the format evolve and settle a bit before investing lots in trying to document it. Mostly wanted to write down the thought.
toolchain/semantics/semantics_ir.h
Outdated
@@ -261,10 +262,12 @@ class SemanticsIR { | |||
} | |||
|
|||
// Produces a string version of a type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment to document how in_type_context
impacts this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
GetScopeInfo(fn_scope).name = | ||
"@" + | ||
globals.AllocateName(*this, | ||
/*TODO*/ ParseTree::Node::Invalid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a bit of context for what this should be doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// This should not happen in valid IR. | ||
return "<noderef " + llvm::itostr(node_id.index) + ">"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about here and elsewhere something like:
// This should not happen in valid IR. | |
return "<noderef " + llvm::itostr(node_id.index) + ">"; | |
// This should not happen in valid IR. | |
return "<invalid noderef " + llvm::itostr(node_id.index) + ">"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, went with "unexpected" rather than "invalid" because the node reference itself is valid (references a valid node), but not one that we expected to be referenced by ID.
if (allocated.insert(name).second) { | ||
return name; | ||
} | ||
name += "."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's useful to distinguish the column number from a non-location disambiguator.
I know it makes the name longer, but maybe:
name += "."; | |
name += ".c"; |
I also find c
a bit easier to separate from the number than C
, and I'd probably use l
for consistency, although there both l
and L
are hard to visually separate. Somewhat minor though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format updated per in-person discussion.
if (allocated.insert(name).second) { | ||
return name; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the examples, I've found it a bit hard to use the line number alone when it is just the first entity on the line and subsequent ones have columns.
The reason is because I haven't seen the subsequent names yet when I see the first one without any column suffix, and the line given has a bunch of different constructs. I have to go find the next N things on that line, and eliminate those based on column number until one is remaining.
Ultimately, I think when there are multiple entries on a line, it's better to just put the column number in all of them.
The same thing comes up with multiple entries at the same column, but I didn't run into that as much reading examples -- not sure if that's just because it doesn't happen much or because it isn't as hard to disambiguate. But there also isn't anything to be done in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using the shortest unambiguous name for an entity rather than giving a name to the first entity that wants it.
If we need to disambiguate, add the disambiguator to all versions of the name including ones that appeared earlier. Use subscript numbers as the final disambiguator. This is overall aimed to make names shorter, more visually distinctive, and more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some non-blocking comments and some suggestions / alternatives below.
Mostly, I think this is a good initial cut. There is lots we can do to play with the format and other things to refine until it stabilizes a bit. But seems better to get an initial version into the tree and iterate / fix-forward.
operator llvm::StringRef() const { return str(); } | ||
operator llvm::Twine() const { return str(); } | ||
operator std::string() const { return str().str(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised all of these are needed rather than just StringRef
? Extra surprised by making std::string
construction be implicit.
(Happy with either learning why these are needed or narrowing to just StringRef
, not really intended to be blocking.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were an attempt to remove the ugly .str().str()
calls, converting explicitly first to StringRef
and then to std::string
. But in the end, we only do that twice, which I think means these don't pull their weight. Removed and added the .str().str()
s instead.
}; | ||
|
||
// All names start with the prefix. | ||
name.insert(0, prefix.data(), prefix.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the StringViewLike
.insert
not work with llvm::StringRef
? If so, that's... annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize we were using a new enough standard library for that, and it does! Thanks.
// Append location information to try to disambiguate. | ||
if (node.is_valid()) { | ||
auto token = namer.parse_tree_.node_token(node); | ||
name += ".loc"; | ||
name += llvm::itostr(namer.tokenized_buffer_.GetLineNumber(token)); | ||
add_name(); | ||
|
||
name += "_"; | ||
name += llvm::itostr(namer.tokenized_buffer_.GetColumnNumber(token)); | ||
add_name(); | ||
} | ||
|
||
// Append numbers until we find an available name. | ||
name += "."; | ||
auto name_size_without_counter = name.size(); | ||
for (int counter = 1;; ++counter) { | ||
name.resize(name_size_without_counter); | ||
name += llvm::itostr(counter); | ||
if (add_name(/*mark_ambiguous=*/false)) { | ||
return best; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using a string-stream be clearer? It seems a tiny bit better to me, but optional if you prefer as-is:
// Append location information to try to disambiguate. | |
if (node.is_valid()) { | |
auto token = namer.parse_tree_.node_token(node); | |
name += ".loc"; | |
name += llvm::itostr(namer.tokenized_buffer_.GetLineNumber(token)); | |
add_name(); | |
name += "_"; | |
name += llvm::itostr(namer.tokenized_buffer_.GetColumnNumber(token)); | |
add_name(); | |
} | |
// Append numbers until we find an available name. | |
name += "."; | |
auto name_size_without_counter = name.size(); | |
for (int counter = 1;; ++counter) { | |
name.resize(name_size_without_counter); | |
name += llvm::itostr(counter); | |
if (add_name(/*mark_ambiguous=*/false)) { | |
return best; | |
} | |
} | |
// Append location information to try to disambiguate. | |
llvm::raw_string_ostream name_os(name); | |
if (node.is_valid()) { | |
auto token = namer.parse_tree_.node_token(node); | |
name_os << ".loc" << namer.tokenized_buffer_.GetLineNumber(token); | |
add_name(); | |
name_os << "_" << namer.tokenized_buffer_.GetColumnNumber(token); | |
add_name(); | |
} | |
// Append numbers until we find an available name. | |
name_os << "."; | |
auto name_size_without_counter = name.size(); | |
for (int counter = 1;; ++counter) { | |
name.resize(name_size_without_counter); | |
name_os << counter; | |
if (add_name(/*mark_ambiguous=*/false)) { | |
return best; | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uncomfortable about modifying the std::string
while it's being held by the ostream (even though it looks like raw_string_ostream
doesn't cache anything about the string between output calls right now).
Done, creating the stream more frequently.
// Sequentially number all remaining values. | ||
for (auto node_id : semantics_ir_.GetNodeBlock(block_id)) { | ||
auto node = semantics_ir_.GetNode(node_id); | ||
if (node.kind() != SemanticsNodeKind::BindName && | ||
node.kind().value_kind() != SemanticsNodeValueKind::None) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this can't be part of the above loop over the nodes? Either adding that to the comment or merging are fine, mostly wasn't obvious to me when reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
// BindName is handled by the NodeNamer and doesn't appear in the output. | ||
template <> | ||
auto FormatInstruction<SemanticsNode::BindName>(SemanticsNodeId, | ||
SemanticsNode) -> void {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised at this. If we have nodes to track the name binding, why not show them in the IR? I understand that we'll also actually use the bound name, just somewhat curious about the rationale here.
(Not blocking at all.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only purpose of a BindName
node in the IR currently is to give a name to some other node. We never reference BindName
nodes or use them for any purpose other than as an annotation to use a specific name for another node, so including them in the IR would only add noise.
I've extended the comment here to explain.
Co-authored-by: Chandler Carruth <[email protected]>
Co-authored-by: Chandler Carruth <[email protected]>
b436b51
to
ecc72e9
Compare
This test has its own directory, since #3056. We haven't added any more, so fold it into basics. Also simplify it a little using `else`, and `no_prelude`. Note, I'm not even sure how much we need this test given the `%.loc<line>_<col>`, but I feel slightly worse deleting it.
Add a textual IR format to the toolchain.
The exact details of the format are somewhat arbitrary right now, and I expect them to change as we refine the semantics IR model, but at the moment they're somewhat directly following the current structure of the IR.
Semantics tests currently test both the "raw" format, which shows the details of the representation, and the textual format, which is somewhat higher level. We may want to revisit that decision once the textual format is a bit more stable, and test only the textual format in most of these tests, but for now it seems prudent to keep both sets of tests.