-
Notifications
You must be signed in to change notification settings - Fork 634
MAINT: Adding simulated assistant role #1292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
MAINT: Adding simulated assistant role #1292
Conversation
| """ | ||
| Check if this is a simulated assistant response. | ||
| Simulated responses come from prepended conversations or generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we are just branching off of an existing conversation? Then it's not really simulated but actually happened...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this PR, the place we're setting conversations to simulated assistant responses are essentially when we're passing in prepended_conversations to attacks. Whether or not these were real responses that happened in the past, for the current attack conversation, it's not a conversation that took place, but rather user supplied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see so this wouldn't apply to Crescendo or TAP. I predict lots of prepending happening when people branch from existing conversations in the GUI. In that case, marking as simulated feels like the right thing to do BUT they should also be related (?) The problem is, it could be that someone does a 4 turn conversation and then the 4th turn is a refusal. So they branch off from after the 3rd turn and create a new 4th turn in a new conversation. Now the first 3 turns actually did happen and were not "simulated" (reading that as "generated"). How do we handle that? Because this could be very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nuanced and there are pieces I hadn't thought about, like the GUI. But as is, "simulated assistant" doesn't apply to branching. IMO if the response came from the objective target as part of the attack strategy, then it should be a real assistant response (even if it's not in that conversation). But granted this is sort of a gray area. Because I think for the GUI, it may make sense to be even more aggressive (e.g. if a user branches and sets it, maybe it should be simulated? It's a little more blurred because there isn't exactly a "start the attack" for the GUI)
But here is why I like the distinction.
If a response came from the objective target, we want to be able to say "the objective target sent this". That way we can mark it as a finding. But if I'm a user and I set it beforehand, then it isn't a finding.
In some of the earlier PRs I'm making it easier for users to set pre-conversations. So I think this distinction will matter more.
Adds a new
simulated_assistantrole to distinguish synthetic responses (prepended conversations,SeedPrompts) from actual target responses. Behaves identically to assistant for API calls but is preserved in memory.Key Changes
MessagePiece/Message: Addedapi_role(mapssimulated_assistant→assistant),is_simulated, andget_role_for_storage(). Deprecated.rolegetter.mark_messages_as_simulated()helper;format_conversation_context()labels simulated as "Assistant (simulated)"SeedGroup.to_messages()convertsassistant→simulated_assistantapi_role)