Skip to content

Conversation

@danenania
Copy link
Contributor

This PR contains the same changes as #5 to test that the line number annotation fix is working correctly.

Expected behavior with the new line number annotations:

  • Scanner should now correctly identify line numbers
  • mcp tool should be reported at L51-64 (not L32-44 as in feat: new agent #5)
  • mcp1 tool should be reported at L66-79 (not L46-59 as in feat: new agent #5)

The fix adds absolute line number annotations (L51:, L52:, etc.) to git diffs so the LLM can directly read them without manual calculation.

Related: #5

Copy link

@promptfoo-scanner promptfoo-scanner bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this PR for LLM security vulnerabilities and found three high-severity issues. The new multi-turn conversation support creates a prompt injection vector that bypasses guardrails, the Information Agent has overly broad Dropbox access without approval gates, and agent outputs are logged without PII redaction. Together, these create a complete attack chain for unauthorized data access.

Minimum severity threshold for this scan: High

Comment on lines 341 to +351
workflow_id: "wf_68ffb83dbfc88190a38103c2bb9f421003f913035dbdb131",
},
});
const guardrailsInputtext = workflow.input_as_text;

// Extract text for guardrails check - use last user message
const lastUserMsg = conversationHistory
.filter((m) => "role" in m && m.role === "user")
.pop();
const guardrailsInputtext =
lastUserMsg &&
"content" in lastUserMsg &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High

The guardrails only check the last user message in multi-turn conversations, allowing attackers to inject malicious instructions in earlier messages or craft fake assistant messages to manipulate agent behavior. This bypasses your jailbreak and PII detection guardrails entirely.

💡 Suggested Fix

Validate all user messages through guardrails, not just the last one:

// Check ALL user messages, not just the last one
const allUserMessages = conversationHistory
  .filter((m) => "role" in m && m.role === "user")
  .map((m) => {
    if ("content" in m && Array.isArray(m.content) && m.content[0] && "text" in m.content[0]) {
      return m.content[0].text;
    }
    return "";
  })
  .filter((text) => text.length > 0);

// Validate each message
let guardrailsHastripwire = false;
let combinedResults = [];

for (const messageText of allUserMessages) {
  const guardrailsResult = await runGuardrails(
    messageText,
    jailbreakGuardrailConfig,
    context
  );
  combinedResults.push(...(guardrailsResult ?? []));

  if (guardrailsHasTripwire(guardrailsResult)) {
    guardrailsHastripwire = true;
  }
}

const guardrailsOutput = guardrailsHastripwire
  ? buildGuardrailFailOutput(combinedResults)
  : { safe_text: allUserMessages.join(" ") };

Also add validation to prevent fake assistant messages (lines 308-332) by checking message sequence structure and rejecting conversations with consecutive assistant messages.

🤖 AI Agent Prompt

There's a prompt injection vulnerability in the multi-turn conversation implementation at agent.ts:341-351. The guardrail check only examines the last user message, but earlier messages in the conversation array are not validated. This means an attacker can inject malicious instructions in earlier messages or craft fake "assistant" messages that manipulate the agent's behavior.

Investigate the data flow from workflow.messages (line 271) through convertToAgentInputItems() (lines 278-302) to the guardrail check (lines 341-351). The guardrail extraction uses .pop() to get only the last user message, but the full conversationHistory (including unvalidated earlier messages) is passed to agents at lines 378, 401, 438, and 464.

The fix should:

  1. Apply guardrails to ALL user messages in the array, not just the last one
  2. Add validation to detect fake assistant messages (check for consecutive assistant messages, ensure first message is from user)
  3. Consider whether assistant messages should be authenticated via session state to prevent forgery

Look for any existing message validation utilities in the codebase that could be extended for this purpose.

Comment on lines +65 to +77

const mcp1 = hostedMcpTool({
serverLabel: "dropbox",
connectorId: "connector_dropbox",
serverDescription: "Knowledge Base",
allowedTools: [
"fetch",
"fetch_file",
"get_profile",
"list_recent_files",
"search",
"search_files",
],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High

The Information Agent has unrestricted Dropbox access with no approval gate. Combined with instructions to "check all files first that you can see, use any files available to your access" (lines 235-241), this creates excessive agency that can be exploited via prompt injection to retrieve sensitive files beyond customer service content.

💡 Suggested Fix

Add an approval requirement to prevent unauthorized file access:

const mcp1 = hostedMcpTool({
  serverLabel: "dropbox",
  connectorId: "connector_dropbox",
  serverDescription: "Knowledge Base",
  allowedTools: [
    "fetch",
    "fetch_file",
    "get_profile",
    "list_recent_files",
    "search",
    "search_files",
  ],
  requireApproval: "always",  // Changed from "never"
});

Also revise the agent instructions at lines 235-241 to remove the broad "use any files available" language and explicitly restrict access to customer service documentation only.

🤖 AI Agent Prompt

The Information Agent at agent.ts:232-249 has excessive agency through unrestricted Dropbox access. The MCP tool configuration at lines 65-77 sets requireApproval: "never", and the agent instructions at lines 235-241 explicitly tell it to "check all files first that you can see, use any files available to your access."

This creates a security issue where prompt injection (see the multi-turn conversation vulnerability) can manipulate the agent to search for and retrieve sensitive files beyond customer service content. The tools include search, search_files, fetch, and fetch_file which provide broad read access.

Investigate:

  1. Whether the hostedMcpTool API supports path-based restrictions (check SDK documentation for allowedPaths or similar parameters)
  2. Whether approval gates can be granular (per-tool rather than all-or-nothing)
  3. What files are actually stored in the Dropbox account to understand the exposure scope
  4. Whether there's a way to sandbox or scope the Dropbox connector to specific directories

The fix should involve:

  • Changing requireApproval: "never" to "always" (minimum)
  • Adding path restrictions if supported by the SDK
  • Revising agent instructions to remove broad access language
  • Consider applying similar restrictions to the Return Agent's MCP tool (lines 51-63) for defense in depth

Comment on lines +392 to +394
console.log(
`\n🔍 Classification: ${classificationAgentResult.output_parsed.classification}\n`
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High

Agent outputs containing customer PII are logged to console without redaction. These logs are typically aggregated to centralized systems accessible by developers, ops teams, and third-party services, creating cross-user data exposure. Similar unredacted logging occurs at lines 414, 449-451, and 475-477.

💡 Suggested Fix

Remove console logging of agent outputs in production, or implement PII redaction:

function redactSensitiveData(text: string): string {
  let redacted = text;
  redacted = redacted.replace(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g, '[EMAIL]');
  redacted = redacted.replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '[PHONE]');
  redacted = redacted.replace(/\b(customer_id|account_id):\s*\S+/gi, '$1: [REDACTED]');
  return redacted;
}

// Apply to all console.log calls
console.log(`\n🔍 Classification: ${redactSensitiveData(classificationAgentResult.output_parsed.classification)}\n`);

Best practice: Remove console.log statements entirely for production deployments and use a proper logging framework with PII controls for development.

🤖 AI Agent Prompt

Agent outputs are being logged to console without PII redaction at multiple locations in agent.ts:

  • Line 392-394: Classification results
  • Line 414: Return Agent output
  • Line 449-451: Retention Agent output
  • Line 475-477: Information Agent output

In a customer service context, these outputs frequently contain PII (names, emails, phone numbers, account IDs). Console logs typically go to centralized log aggregation systems (Datadog, Splunk, CloudWatch, etc.) that are accessible to broader audiences than the customer data itself should be.

Additionally, the Information Agent may include sensitive content retrieved from Dropbox in its outputs, which would also be logged.

Investigate:

  1. Where do console logs go in this application's deployment (stdout to where?)
  2. Who has access to the log aggregation system
  3. Whether there's an existing PII redaction utility in the codebase
  4. Whether these console.log statements are intended for development only or also production
  5. What the organization's data protection requirements are for customer PII

The fix should either:

  • Remove console.log statements entirely (simplest for production)
  • Implement comprehensive PII redaction before logging
  • Use a structured logging framework with proper log levels and PII controls
  • Ensure log access is appropriately restricted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants