Tool Calling Experiment

Overview

The Tool Calling Experiment explores how Large Language Models can effectively leverage external tools and APIs to extend their capabilities beyond text generation. This research investigates optimal patterns for tool integration, automatic tool selection, error handling, and the design of tool interfaces that maximize LLM effectiveness.

Motivation

LLMs have inherent limitations:

No Real-Time Data: Training data cutoff dates
No Computation: Can’t perform complex calculations
No External Actions: Can’t interact with databases, APIs, or systems
Hallucination Risk: May invent information

Tool calling bridges these gaps by allowing LLMs to invoke external functions, access real-time data, and perform actions in the world.

Research Questions

Tool Selection: How do LLMs decide which tool to use?
Parameter Extraction: Can LLMs reliably extract tool parameters from natural language?
Error Recovery: How should systems handle tool failures?
Chaining: Can LLMs chain multiple tools for complex tasks?
Learning: Can LLMs learn which tools work best for which tasks?

Experimental Framework

Tool Interface Design

from pydantic import BaseModel, Field
from typing import Optional, List, Any

class ToolParameter(BaseModel):
    name: str
    type: str  # 'string', 'number', 'boolean', 'array', 'object'
    description: str
    required: bool = True
    default: Optional[Any] = None

class Tool(BaseModel):
    name: str
    description: str
    parameters: List[ToolParameter]

    def execute(self, **kwargs) -> Any:
        """Override in subclass"""
        raise NotImplementedError

class CalculatorTool(Tool):
    name = "calculator"
    description = "Perform mathematical calculations"
    parameters = [
        ToolParameter(
            name="expression",
            type="string",
            description="Mathematical expression to evaluate (e.g., '2 + 2', 'sqrt(16)')"
        )
    ]

    def execute(self, expression: str) -> float:
        # Safe evaluation with restricted namespace
        allowed_names = {
            'sqrt': math.sqrt,
            'pow': math.pow,
            'abs': abs,
            # ... more safe functions
        }
        return eval(expression, {"__builtins__": {}}, allowed_names)

class WebSearchTool(Tool):
    name = "web_search"
    description = "Search the web for current information"
    parameters = [
        ToolParameter(
            name="query",
            type="string",
            description="Search query"
        ),
        ToolParameter(
            name="num_results",
            type="number",
            description="Number of results to return",
            required=False,
            default=5
        )
    ]

    def execute(self, query: str, num_results: int = 5) -> List[dict]:
        # Integration with search API
        results = search_api.search(query, limit=num_results)
        return results

Tool Selection Strategies

Strategy 1: LLM-Based Selection

class LLMToolSelector:
    def __init__(self, llm, tools: List[Tool]):
        self.llm = llm
        self.tools = {t.name: t for t in tools}

    def select_tool(self, user_query: str) -> Optional[dict]:
        # Describe available tools to LLM
        tool_descriptions = self.format_tools_for_llm()

        prompt = f"""
        User query: {user_query}

        Available tools:
        {tool_descriptions}

        Which tool should be used? Respond with JSON:
        {
            "tool_name": "name of tool",
            "parameters": {parameter_name: parameter_value},
            "reasoning": "why this tool"
        }

        If no tool is needed, respond with {"tool_name": null}
        """

        response = self.llm.generate(prompt)
        tool_call = json.loads(response)

        return tool_call

Strategy 2: Native Function Calling

class NativeFunctionCaller:
    """Uses OpenAI/Claude native function calling"""

    def __init__(self, llm_client, tools: List[Tool]):
        self.client = llm_client
        self.tools = tools

    def call_with_tools(self, user_query: str):
        # Convert tools to OpenAI function format
        functions = [self.tool_to_function_spec(t) for t in self.tools]

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": user_query}],
            functions=functions,
            function_call="auto"
        )

        message = response.choices[0].message

        if message.function_call:
            # LLM wants to call a function
            tool_name = message.function_call.name
            arguments = json.loads(message.function_call.arguments)

            # Execute the tool
            tool = self.get_tool(tool_name)
            result = tool.execute(**arguments)

            return result
        else:
            # LLM responded directly without tools
            return message.content

Strategy 3: Semantic Similarity

class SemanticToolSelector:
    """Select tools based on embedding similarity"""

    def __init__(self, tools: List[Tool], embedding_model):
        self.tools = tools
        self.embedder = embedding_model

        # Pre-compute tool embeddings
        self.tool_embeddings = {
            t.name: self.embedder.embed(t.description)
            for t in tools
        }

    def select_tool(self, user_query: str) -> Tool:
        query_embedding = self.embedder.embed(user_query)

        # Find most similar tool
        similarities = {
            name: cosine_similarity(query_embedding, tool_emb)
            for name, tool_emb in self.tool_embeddings.items()
        }

        best_tool_name = max(similarities, key=similarities.get)
        return self.get_tool(best_tool_name)

Tool Chaining

class ToolChainExecutor:
    """Execute multiple tools in sequence"""

    def __init__(self, llm, tools: List[Tool]):
        self.llm = llm
        self.tools = {t.name: t for t in tools}
        self.max_iterations = 10

    def execute_chain(self, user_query: str):
        conversation = [{"role": "user", "content": user_query}]
        iteration = 0

        while iteration < self.max_iterations:
            # LLM decides next action
            response = self.llm.chat(
                conversation,
                functions=self.get_function_specs()
            )

            if response.function_call:
                # Execute tool
                tool_name = response.function_call.name
                arguments = json.loads(response.function_call.arguments)

                tool = self.tools[tool_name]
                result = tool.execute(**arguments)

                # Add tool result to conversation
                conversation.append({
                    "role": "function",
                    "name": tool_name,
                    "content": str(result)
                })

                iteration += 1

            else:
                # LLM has final answer
                return response.content

        return "Max iterations reached"

Experimental Tools

Created diverse tools to test different scenarios:

1. Data Tools

Database Query: SQL database access
API Fetch: REST API calls
File Read/Write: Local file operations

2. Computation Tools

Calculator: Mathematical expressions
Code Executor: Run Python code safely
Data Analysis: Pandas operations

3. External Service Tools

Web Search: Google/Bing search
Weather API: Current weather data
Stock Prices: Financial data lookup

4. Action Tools

Email Sender: Send emails
Calendar: Schedule events
Notification: Push notifications

5. Specialized Tools

Image Generator: DALL-E integration
Translator: Multi-language translation
Summarizer: Long document summarization

Experiments & Results

Experiment 1: Tool Selection Accuracy

Setup: 100 user queries, measure which strategy selects correct tool

Results:

Strategy	Accuracy	Latency
Native Function Calling	94%	1.2s
LLM-Based Selection	89%	2.1s
Semantic Similarity	76%	0.3s

Insights:

Native function calling most reliable
Semantic similarity fastest but less accurate
LLM-based selection good for complex cases

Experiment 2: Parameter Extraction

Setup: Test if LLM correctly extracts parameters from natural language

Example:

User: “What’s the weather in San Francisco tomorrow?”
Expected: get_weather(location="San Francisco", date="tomorrow")

Results:

Success Rate: 91% correctly extracted all parameters
Common Errors:
- Date parsing (e.g., “tomorrow” → actual date): 6%
- Location ambiguity (e.g., “Paris” → Paris, France vs. Paris, Texas): 2%
- Missing optional parameters: 1%

Improvements:

Added parameter validation with retries
Provided examples in tool descriptions
Used Pydantic for type checking

Experiment 3: Error Recovery

Setup: Introduce tool failures, measure recovery success

Error Types:

Invalid Parameters: 400 Bad Request
Service Unavailable: 503 error
Timeout: No response within 10s
Unexpected Output: Tool returns wrong format

Recovery Strategies:

class ErrorRecoveryExecutor:
    def execute_with_recovery(self, tool: Tool, params: dict):
        try:
            result = tool.execute(**params)
            return {"success": True, "result": result}

        except ValidationError as e:
            # Parameter validation failed
            correction_prompt = f"""
            Tool call failed with validation error:
            {str(e)}

            Original parameters: {params}

            Please provide corrected parameters as JSON.
            """
            corrected = self.llm.generate(correction_prompt)
            return self.execute_with_recovery(tool, json.loads(corrected))

        except TimeoutError:
            # Retry with different tool or inform user
            fallback = self.find_fallback_tool(tool.name)
            if fallback:
                return self.execute_with_recovery(fallback, params)
            else:
                return {"success": False, "error": "Service timeout"}

        except Exception as e:
            # Unexpected error
            return {"success": False, "error": str(e)}

Results:

Recovery Success: 78% of failures recovered
Retry Success: 65% succeed on second attempt
Fallback Success: 40% succeed with alternative tool

Experiment 4: Multi-Tool Chaining

Setup: Complex tasks requiring multiple tools

Example Task: “Find the current stock price of Apple and calculate 15% of it”

Expected Chain:

stock_price(symbol="AAPL") → $178.50
calculator(expression="178.50 * 0.15") → $26.78

Results:

Success Rate: 85% correctly chain 2 tools
3+ Tool Chains: 68% success rate
Average Chain Length: 2.3 tools per complex query

Failure Modes:

Incorrect order (calculator before stock lookup): 8%
Missing intermediate step: 5%
Infinite loops (tool calls same tool repeatedly): 2%

Experiment 5: Tool Learning

Setup: Track which tools work best for query types, learn patterns

class ToolLearningSystem:
    def __init__(self):
        self.success_history = defaultdict(list)

    def record_success(self, query_type: str, tool_name: str, success: bool):
        self.success_history[query_type].append({
            'tool': tool_name,
            'success': success,
            'timestamp': time.time()
        })

    def get_recommended_tool(self, query_type: str) -> str:
        history = self.success_history[query_type]
        if not history:
            return None

        # Calculate success rate per tool
        tool_stats = defaultdict(lambda: {'successes': 0, 'total': 0})
        for record in history[-50:]:  # Last 50 attempts
            tool_stats[record['tool']]['total'] += 1
            if record['success']:
                tool_stats[record['tool']]['successes'] += 1

        # Recommend tool with highest success rate
        best_tool = max(
            tool_stats.items(),
            key=lambda x: x[1]['successes'] / x[1]['total']
        )
        return best_tool[0]

Results:

After 100 queries, recommendation accuracy: 92%
System learned query patterns and optimal tools
Reduced average query resolution time by 30%

Advanced Patterns

Parallel Tool Execution

async def execute_parallel_tools(self, tool_calls: List[dict]):
    """Execute multiple tools simultaneously"""

    tasks = [
        self.execute_tool_async(call['tool'], call['params'])
        for call in tool_calls
    ]

    results = await asyncio.gather(*tasks, return_exceptions=True)

    return results

Use Case: “What’s the weather in New York, London, and Tokyo?”

Execute 3 weather API calls in parallel
3x faster than sequential execution

Conditional Tool Execution

def execute_conditional(self, condition_tool, true_tool, false_tool):
    """Execute different tools based on condition"""

    condition_result = condition_tool.execute()

    if condition_result:
        return true_tool.execute()
    else:
        return false_tool.execute()

Use Case: “If stock price > $150, buy 10 shares, else wait”

Tool Composition

class CompositeTool(Tool):
    """Combine multiple tools into one"""

    def __init__(self, tools: List[Tool], composition_fn):
        self.tools = tools
        self.compose = composition_fn

    def execute(self, **kwargs):
        results = [tool.execute(**kwargs) for tool in self.tools]
        return self.compose(*results)

# Example: Create "weather_and_outfit" tool
weather_outfit = CompositeTool(
    tools=[weather_tool, outfit_recommender],
    composition_fn=lambda w, o: f"Weather: {w}. Recommended outfit: {o}"
)

Best Practices Discovered

Clear Tool Descriptions: More detailed → better selection
Parameter Validation: Use Pydantic for type safety
Error Messages: Helpful errors enable better retries
Rate Limiting: Prevent runaway tool calling
Logging: Track all tool calls for debugging
Timeouts: Set reasonable limits for tool execution
Fallbacks: Have alternative tools when possible
Testing: Comprehensive test suite for each tool

Challenges & Lessons

Challenge: Hallucinated Tools

Problem: LLM invents non-existent tools

Solution: Strictly validate tool names before execution

Challenge: Parameter Hallucination

Problem: LLM provides plausible but incorrect parameters

Solution: Strict schema validation, retry with error feedback

Challenge: Cost Optimization

Problem: Function calling increases API costs

Solution:

Cache tool results when possible
Use cheaper models for tool selection
Batch similar tool calls

Challenge: Security

Problem: Malicious prompts could abuse tools

Solution:

Whitelist allowed tools per user
Validate all parameters
Rate limiting and monitoring
Sandbox execution environments

Impact & Applications

Customer Support Bot

Tools: Knowledge base search, order lookup, refund processing
Results: 60% query resolution without human
Customer Satisfaction: 4.3/5

Data Analysis Assistant

Tools: SQL queries, data visualization, statistical tests
Results: 10x faster exploratory analysis
Adoption: Used by 50+ analysts

Personal AI Assistant

Tools: Calendar, email, reminders, web search
Results: Saves 2 hours/day on average
User Feedback: “Like having a personal assistant”

Future Research

Automatic Tool Generation: LLM creates new tools as needed
Tool Discovery: Agent explores and learns about new tools
Multi-Modal Tools: Tools handling images, audio, video
Collaborative Tools: Multiple agents using shared tools
Tool Marketplaces: Ecosystem of third-party tools
Self-Improving Tools: Tools that optimize themselves based on usage

Open Source Contributions

Repositories

tool-calling-framework: Comprehensive toolkit
tool-library: 50+ pre-built tools
evaluation-suite: Benchmark for tool calling systems

Documentation

Comprehensive API reference
Tutorials for building custom tools
Best practices guide
Security considerations

Resources

Code Repository: [github.com/umberH/tool-calling-experiment] Live Demo: [Interactive tool calling playground]

Technologies Used

Overview

Motivation

Research Questions

Experimental Framework

Tool Interface Design

Tool Selection Strategies

Strategy 1: LLM-Based Selection

Strategy 2: Native Function Calling

Strategy 3: Semantic Similarity

Tool Chaining

Experimental Tools

1. Data Tools

2. Computation Tools

3. External Service Tools

4. Action Tools

5. Specialized Tools

Experiments & Results

Experiment 1: Tool Selection Accuracy

Experiment 2: Parameter Extraction

Experiment 3: Error Recovery

Experiment 4: Multi-Tool Chaining

Experiment 5: Tool Learning

Advanced Patterns

Parallel Tool Execution

Conditional Tool Execution

Tool Composition

Best Practices Discovered

Challenges & Lessons

Challenge: Hallucinated Tools

Challenge: Parameter Hallucination

Challenge: Cost Optimization

Challenge: Security

Impact & Applications

Customer Support Bot

Data Analysis Assistant

Personal AI Assistant

Future Research

Open Source Contributions

Repositories

Documentation

Resources

Related Projects

Survey on Explainable AI for Traditional Machine Learning and Domains

Causal Graph Learning

GraphRAG

Interested in collaborating?