Tool Calling Experiment
Experimental framework for enhancing LLM capabilities through dynamic tool integration, exploring function calling patterns, tool selection strategies, and error recovery mechanisms.
Technologies Used
Overview
The Tool Calling Experiment explores how Large Language Models can effectively leverage external tools and APIs to extend their capabilities beyond text generation. This research investigates optimal patterns for tool integration, automatic tool selection, error handling, and the design of tool interfaces that maximize LLM effectiveness.
Motivation
LLMs have inherent limitations:
- No Real-Time Data: Training data cutoff dates
- No Computation: Can’t perform complex calculations
- No External Actions: Can’t interact with databases, APIs, or systems
- Hallucination Risk: May invent information
Tool calling bridges these gaps by allowing LLMs to invoke external functions, access real-time data, and perform actions in the world.
Research Questions
- Tool Selection: How do LLMs decide which tool to use?
- Parameter Extraction: Can LLMs reliably extract tool parameters from natural language?
- Error Recovery: How should systems handle tool failures?
- Chaining: Can LLMs chain multiple tools for complex tasks?
- Learning: Can LLMs learn which tools work best for which tasks?
Experimental Framework
Tool Interface Design
from pydantic import BaseModel, Field
from typing import Optional, List, Any
class ToolParameter(BaseModel):
name: str
type: str # 'string', 'number', 'boolean', 'array', 'object'
description: str
required: bool = True
default: Optional[Any] = None
class Tool(BaseModel):
name: str
description: str
parameters: List[ToolParameter]
def execute(self, **kwargs) -> Any:
"""Override in subclass"""
raise NotImplementedError
class CalculatorTool(Tool):
name = "calculator"
description = "Perform mathematical calculations"
parameters = [
ToolParameter(
name="expression",
type="string",
description="Mathematical expression to evaluate (e.g., '2 + 2', 'sqrt(16)')"
)
]
def execute(self, expression: str) -> float:
# Safe evaluation with restricted namespace
allowed_names = {
'sqrt': math.sqrt,
'pow': math.pow,
'abs': abs,
# ... more safe functions
}
return eval(expression, {"__builtins__": {}}, allowed_names)
class WebSearchTool(Tool):
name = "web_search"
description = "Search the web for current information"
parameters = [
ToolParameter(
name="query",
type="string",
description="Search query"
),
ToolParameter(
name="num_results",
type="number",
description="Number of results to return",
required=False,
default=5
)
]
def execute(self, query: str, num_results: int = 5) -> List[dict]:
# Integration with search API
results = search_api.search(query, limit=num_results)
return results
Tool Selection Strategies
Strategy 1: LLM-Based Selection
class LLMToolSelector:
def __init__(self, llm, tools: List[Tool]):
self.llm = llm
self.tools = {t.name: t for t in tools}
def select_tool(self, user_query: str) -> Optional[dict]:
# Describe available tools to LLM
tool_descriptions = self.format_tools_for_llm()
prompt = f"""
User query: {user_query}
Available tools:
{tool_descriptions}
Which tool should be used? Respond with JSON:
{
"tool_name": "name of tool",
"parameters": {parameter_name: parameter_value},
"reasoning": "why this tool"
}
If no tool is needed, respond with {"tool_name": null}
"""
response = self.llm.generate(prompt)
tool_call = json.loads(response)
return tool_call
Strategy 2: Native Function Calling
class NativeFunctionCaller:
"""Uses OpenAI/Claude native function calling"""
def __init__(self, llm_client, tools: List[Tool]):
self.client = llm_client
self.tools = tools
def call_with_tools(self, user_query: str):
# Convert tools to OpenAI function format
functions = [self.tool_to_function_spec(t) for t in self.tools]
response = self.client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": user_query}],
functions=functions,
function_call="auto"
)
message = response.choices[0].message
if message.function_call:
# LLM wants to call a function
tool_name = message.function_call.name
arguments = json.loads(message.function_call.arguments)
# Execute the tool
tool = self.get_tool(tool_name)
result = tool.execute(**arguments)
return result
else:
# LLM responded directly without tools
return message.content
Strategy 3: Semantic Similarity
class SemanticToolSelector:
"""Select tools based on embedding similarity"""
def __init__(self, tools: List[Tool], embedding_model):
self.tools = tools
self.embedder = embedding_model
# Pre-compute tool embeddings
self.tool_embeddings = {
t.name: self.embedder.embed(t.description)
for t in tools
}
def select_tool(self, user_query: str) -> Tool:
query_embedding = self.embedder.embed(user_query)
# Find most similar tool
similarities = {
name: cosine_similarity(query_embedding, tool_emb)
for name, tool_emb in self.tool_embeddings.items()
}
best_tool_name = max(similarities, key=similarities.get)
return self.get_tool(best_tool_name)
Tool Chaining
class ToolChainExecutor:
"""Execute multiple tools in sequence"""
def __init__(self, llm, tools: List[Tool]):
self.llm = llm
self.tools = {t.name: t for t in tools}
self.max_iterations = 10
def execute_chain(self, user_query: str):
conversation = [{"role": "user", "content": user_query}]
iteration = 0
while iteration < self.max_iterations:
# LLM decides next action
response = self.llm.chat(
conversation,
functions=self.get_function_specs()
)
if response.function_call:
# Execute tool
tool_name = response.function_call.name
arguments = json.loads(response.function_call.arguments)
tool = self.tools[tool_name]
result = tool.execute(**arguments)
# Add tool result to conversation
conversation.append({
"role": "function",
"name": tool_name,
"content": str(result)
})
iteration += 1
else:
# LLM has final answer
return response.content
return "Max iterations reached"
Experimental Tools
Created diverse tools to test different scenarios:
1. Data Tools
- Database Query: SQL database access
- API Fetch: REST API calls
- File Read/Write: Local file operations
2. Computation Tools
- Calculator: Mathematical expressions
- Code Executor: Run Python code safely
- Data Analysis: Pandas operations
3. External Service Tools
- Web Search: Google/Bing search
- Weather API: Current weather data
- Stock Prices: Financial data lookup
4. Action Tools
- Email Sender: Send emails
- Calendar: Schedule events
- Notification: Push notifications
5. Specialized Tools
- Image Generator: DALL-E integration
- Translator: Multi-language translation
- Summarizer: Long document summarization
Experiments & Results
Experiment 1: Tool Selection Accuracy
Setup: 100 user queries, measure which strategy selects correct tool
Results:
| Strategy | Accuracy | Latency |
|---|---|---|
| Native Function Calling | 94% | 1.2s |
| LLM-Based Selection | 89% | 2.1s |
| Semantic Similarity | 76% | 0.3s |
Insights:
- Native function calling most reliable
- Semantic similarity fastest but less accurate
- LLM-based selection good for complex cases
Experiment 2: Parameter Extraction
Setup: Test if LLM correctly extracts parameters from natural language
Example:
- User: “What’s the weather in San Francisco tomorrow?”
- Expected:
get_weather(location="San Francisco", date="tomorrow")
Results:
- Success Rate: 91% correctly extracted all parameters
- Common Errors:
- Date parsing (e.g., “tomorrow” → actual date): 6%
- Location ambiguity (e.g., “Paris” → Paris, France vs. Paris, Texas): 2%
- Missing optional parameters: 1%
Improvements:
- Added parameter validation with retries
- Provided examples in tool descriptions
- Used Pydantic for type checking
Experiment 3: Error Recovery
Setup: Introduce tool failures, measure recovery success
Error Types:
- Invalid Parameters: 400 Bad Request
- Service Unavailable: 503 error
- Timeout: No response within 10s
- Unexpected Output: Tool returns wrong format
Recovery Strategies:
class ErrorRecoveryExecutor:
def execute_with_recovery(self, tool: Tool, params: dict):
try:
result = tool.execute(**params)
return {"success": True, "result": result}
except ValidationError as e:
# Parameter validation failed
correction_prompt = f"""
Tool call failed with validation error:
{str(e)}
Original parameters: {params}
Please provide corrected parameters as JSON.
"""
corrected = self.llm.generate(correction_prompt)
return self.execute_with_recovery(tool, json.loads(corrected))
except TimeoutError:
# Retry with different tool or inform user
fallback = self.find_fallback_tool(tool.name)
if fallback:
return self.execute_with_recovery(fallback, params)
else:
return {"success": False, "error": "Service timeout"}
except Exception as e:
# Unexpected error
return {"success": False, "error": str(e)}
Results:
- Recovery Success: 78% of failures recovered
- Retry Success: 65% succeed on second attempt
- Fallback Success: 40% succeed with alternative tool
Experiment 4: Multi-Tool Chaining
Setup: Complex tasks requiring multiple tools
Example Task: “Find the current stock price of Apple and calculate 15% of it”
Expected Chain:
stock_price(symbol="AAPL")→ $178.50calculator(expression="178.50 * 0.15")→ $26.78
Results:
- Success Rate: 85% correctly chain 2 tools
- 3+ Tool Chains: 68% success rate
- Average Chain Length: 2.3 tools per complex query
Failure Modes:
- Incorrect order (calculator before stock lookup): 8%
- Missing intermediate step: 5%
- Infinite loops (tool calls same tool repeatedly): 2%
Experiment 5: Tool Learning
Setup: Track which tools work best for query types, learn patterns
class ToolLearningSystem:
def __init__(self):
self.success_history = defaultdict(list)
def record_success(self, query_type: str, tool_name: str, success: bool):
self.success_history[query_type].append({
'tool': tool_name,
'success': success,
'timestamp': time.time()
})
def get_recommended_tool(self, query_type: str) -> str:
history = self.success_history[query_type]
if not history:
return None
# Calculate success rate per tool
tool_stats = defaultdict(lambda: {'successes': 0, 'total': 0})
for record in history[-50:]: # Last 50 attempts
tool_stats[record['tool']]['total'] += 1
if record['success']:
tool_stats[record['tool']]['successes'] += 1
# Recommend tool with highest success rate
best_tool = max(
tool_stats.items(),
key=lambda x: x[1]['successes'] / x[1]['total']
)
return best_tool[0]
Results:
- After 100 queries, recommendation accuracy: 92%
- System learned query patterns and optimal tools
- Reduced average query resolution time by 30%
Advanced Patterns
Parallel Tool Execution
async def execute_parallel_tools(self, tool_calls: List[dict]):
"""Execute multiple tools simultaneously"""
tasks = [
self.execute_tool_async(call['tool'], call['params'])
for call in tool_calls
]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
Use Case: “What’s the weather in New York, London, and Tokyo?”
- Execute 3 weather API calls in parallel
- 3x faster than sequential execution
Conditional Tool Execution
def execute_conditional(self, condition_tool, true_tool, false_tool):
"""Execute different tools based on condition"""
condition_result = condition_tool.execute()
if condition_result:
return true_tool.execute()
else:
return false_tool.execute()
Use Case: “If stock price > $150, buy 10 shares, else wait”
Tool Composition
class CompositeTool(Tool):
"""Combine multiple tools into one"""
def __init__(self, tools: List[Tool], composition_fn):
self.tools = tools
self.compose = composition_fn
def execute(self, **kwargs):
results = [tool.execute(**kwargs) for tool in self.tools]
return self.compose(*results)
# Example: Create "weather_and_outfit" tool
weather_outfit = CompositeTool(
tools=[weather_tool, outfit_recommender],
composition_fn=lambda w, o: f"Weather: {w}. Recommended outfit: {o}"
)
Best Practices Discovered
- Clear Tool Descriptions: More detailed → better selection
- Parameter Validation: Use Pydantic for type safety
- Error Messages: Helpful errors enable better retries
- Rate Limiting: Prevent runaway tool calling
- Logging: Track all tool calls for debugging
- Timeouts: Set reasonable limits for tool execution
- Fallbacks: Have alternative tools when possible
- Testing: Comprehensive test suite for each tool
Challenges & Lessons
Challenge: Hallucinated Tools
Problem: LLM invents non-existent tools
Solution: Strictly validate tool names before execution
Challenge: Parameter Hallucination
Problem: LLM provides plausible but incorrect parameters
Solution: Strict schema validation, retry with error feedback
Challenge: Cost Optimization
Problem: Function calling increases API costs
Solution:
- Cache tool results when possible
- Use cheaper models for tool selection
- Batch similar tool calls
Challenge: Security
Problem: Malicious prompts could abuse tools
Solution:
- Whitelist allowed tools per user
- Validate all parameters
- Rate limiting and monitoring
- Sandbox execution environments
Impact & Applications
Customer Support Bot
- Tools: Knowledge base search, order lookup, refund processing
- Results: 60% query resolution without human
- Customer Satisfaction: 4.3/5
Data Analysis Assistant
- Tools: SQL queries, data visualization, statistical tests
- Results: 10x faster exploratory analysis
- Adoption: Used by 50+ analysts
Personal AI Assistant
- Tools: Calendar, email, reminders, web search
- Results: Saves 2 hours/day on average
- User Feedback: “Like having a personal assistant”
Future Research
- Automatic Tool Generation: LLM creates new tools as needed
- Tool Discovery: Agent explores and learns about new tools
- Multi-Modal Tools: Tools handling images, audio, video
- Collaborative Tools: Multiple agents using shared tools
- Tool Marketplaces: Ecosystem of third-party tools
- Self-Improving Tools: Tools that optimize themselves based on usage
Open Source Contributions
Repositories
- tool-calling-framework: Comprehensive toolkit
- tool-library: 50+ pre-built tools
- evaluation-suite: Benchmark for tool calling systems
Documentation
- Comprehensive API reference
- Tutorials for building custom tools
- Best practices guide
- Security considerations
Resources
Code Repository: [github.com/umberH/tool-calling-experiment] Live Demo: [Interactive tool calling playground]
Interested in collaborating?
Let's discuss how we can work together on innovative projects.