Back

gpt-researcher

GPT Researcher is an autonomous deep research agent that conducts web and local research, producing detailed reports with citations. Use this skill when helping developers understand, extend, debug, or integrate with GPT Researcher - including adding features, understanding the architecture, working with the API, customizing research workflows, adding new retrievers, integrating MCP data sources, or troubleshooting research pipelines.

Skyll
87.3590 installsassafelovic/gpt-researcher

Add to your agent

curl "https://api.skyll.app/skill/gpt-researcher"
SKILL.md
# GPT Researcher Development Skill

GPT Researcher is an LLM-based autonomous agent using a planner-executor-publisher pattern with parallelized agent work for speed and reliability.

## Quick Start

### Basic Python Usage

```python
from gpt_researcher import GPTResearcher
import asyncio

async def main():
    researcher = GPTResearcher(
        query="What are the latest AI developments?",
        report_type="research_report",  # or detailed_report, deep, outline_report
        report_source="web",            # or local, hybrid
    )
    await researcher.conduct_research()
    report = await researcher.write_report()
    print(report)

asyncio.run(main())
```

### Run Servers

```bash
# Backend
python -m uvicorn backend.server.server:app --reload --port 8000

# Frontend
cd frontend/nextjs && npm install && npm run dev
```

---

## Key File Locations

| Need | Primary File | Key Classes |
|------|--------------|-------------|
| Main orchestrator | `gpt_researcher/agent.py` | `GPTResearcher` |
| Research logic | `gpt_researcher/skills/researcher.py` | `ResearchConductor` |
| Report writing | `gpt_researcher/skills/writer.py` | `ReportGenerator` |
| All prompts | `gpt_researcher/prompts.py` | `PromptFamily` |
| Configuration | `gpt_researcher/config/config.py` | `Config` |
| Config defaults | `gpt_researcher/config/variables/default.py` | `DEFAULT_CONFIG` |
| API server | `backend/server/app.py` | FastAPI `app` |
| Search engines | `gpt_researcher/retrievers/` | Various retrievers |

---

## Architecture Overview

```
User Query → GPTResearcher.__init__()
                │
                ▼
         choose_agent() → (agent_type, role_prompt)
                │
                ▼
         ResearchConductor.conduct_research()
           ├── plan_research() → sub_queries
           ├── For each sub_query:
           │     └── _process_sub_query() → context
           └── Aggregate contexts
                │
                ▼
         [Optional] ImageGenerator.plan_and_generate_images()
                │
                ▼
         ReportGenerator.write_report() → Markdown report
```

**For detailed architecture diagrams**: See [references/architecture.md](references/architecture.md)

---

## Core Patterns

### Adding a New Feature (8-Step Pattern)

1. **Config** → Add to `gpt_researcher/config/variables/default.py`
2. **Provider** → Create in `gpt_researcher/llm_provider/my_feature/`
3. **Skill** → Create in `gpt_researcher/skills/my_feature.py`
4. **Agent** → Integrate in `gpt_researcher/agent.py`
5. **Prompts** → Update `gpt_researcher/prompts.py`
6. **WebSocket** → Events via `stream_output()`
7. **Frontend** → Handle events in `useWebSocket.ts`
8. **Docs** → Create `docs/docs/gpt-researcher/gptr/my_feature.md`

**For complete feature addition guide with Image Generation case study**: See [references/adding-features.md](references/adding-features.md)

### Adding a New Retriever

```python
# 1. Create: gpt_researcher/retrievers/my_retriever/my_retriever.py
class MyRetriever:
    def __init__(self, query: str, headers: dict = None):
        self.query = query
    
    async def search(self, max_results: int = 10) -> list[dict]:
        # Return: [{"title": str, "href": str, "body": str}]
        pass

# 2. Register in gpt_researcher/actions/retriever.py
case "my_retriever":
    from gpt_researcher.retrievers.my_retriever import MyRetriever
    return MyRetriever

# 3. Export in gpt_researcher/retrievers/__init__.py
```

**For complete retriever documentation**: See [references/retrievers.md](references/retrievers.md)

---

## Configuration

Config keys are **lowercased** when accessed:

```python
# In default.py: "SMART_LLM": "gpt-4o"
# Access as: self.cfg.smart_llm  # lowercase!
```

Priority: Environment Variables → JSON Config File → Default Values

**For complete configuration reference**: See [references/config-reference.md](references/config-reference.md)

---

## Common Integration Points

### WebSocket Streaming

```python
class WebSocketHandler:
    async def send_json(self, data):
        print(f"[{data['type']}] {data.get('output', '')}")

researcher = GPTResearcher(query="...", websocket=WebSocketHandler())
```

### MCP Data Sources

```python
researcher = GPTResearcher(
    query="Open source AI projects",
    mcp_configs=[{
        "name": "github",
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-github"],
        "env": {"GITHUB_TOKEN": os.getenv("GITHUB_TOKEN")}
    }],
    mcp_strategy="deep",  # or "fast", "disabled"
)
```

**For MCP integration details**: See [references/mcp.md](references/mcp.md)

### Deep Research Mode

```python
researcher = GPTResearcher(
    query="Comprehensive analysis of quantum computing",
    report_type="deep",  # Triggers recursive tree-like exploration
)
```

**For deep research configuration**: See [references/deep-research.md](references/deep-research.md)

---

## Error Handling

Always use graceful degradation in skills:

```python
async def execute(self, ...):
    if not self.is_enabled():
        return []  # Don't crash
    
    try:
        result = await self.provider.execute(...)
        return result
    except Exception as e:
        await stream_output("logs", "error", f"⚠️ {e}", self.websocket)
        return []  # Graceful degradation
```

---

## Critical Gotchas

| ❌ Mistake | ✅ Correct |
|-----------|-----------|
| `config.MY_VAR` | `config.my_var` (lowercased) |
| Editing pip-installed package | `pip install -e .` |
| Forgetting async/await | All research methods are async |
| `websocket.send_json()` on None | Check `if websocket:` first |
| Not registering retriever | Add to `retriever.py` match statement |

---

## Reference Documentation

| Topic | File |
|-------|------|
| System architecture & diagrams | [references/architecture.md](references/architecture.md) |
| Core components & signatures | [references/components.md](references/components.md) |
| Research flow & data flow | [references/flows.md](references/flows.md) |
| Prompt system | [references/prompts.md](references/prompts.md) |
| Retriever system | [references/retrievers.md](references/retrievers.md) |
| MCP integration | [references/mcp.md](references/mcp.md) |
| Deep research mode | [references/deep-research.md](references/deep-research.md) |
| Multi-agent system | [references/multi-agents.md](references/multi-agents.md) |
| Adding features guide | [references/adding-features.md](references/adding-features.md) |
| Advanced patterns | [references/advanced-patterns.md](references/advanced-patterns.md) |
| REST & WebSocket API | [references/api-reference.md](references/api-reference.md) |
| Configuration variables | [references/config-reference.md](references/config-reference.md) |

References (12)

📎 adding-features.md
# Adding Features Guide

## Table of Contents
- [The 8-Step Pattern](#the-8-step-pattern)
- [Image Generation Case Study](#image-generation-case-study)
- [Testing New Features](#testing-new-features)

---

## The 8-Step Pattern

```
┌────────┐    ┌────────┐    ┌────────┐    ┌────────┐
│1.CONFIG│ →  │2.PROVIDER│ → │3.SKILL │ →  │4.AGENT │
└────────┘    └────────┘    └────────┘    └────────┘
     ↓             ↓             ↓             ↓
┌────────┐    ┌────────┐    ┌────────┐    ┌────────┐
│5.PROMPTS│ → │6.WEBSOCKET│→ │7.FRONTEND│→ │8.DOCS  │
└────────┘    └────────┘    └────────┘    └────────┘
```

### Step 1: Add Configuration

**File:** `gpt_researcher/config/variables/default.py`

```python
DEFAULT_CONFIG: BaseConfig = {
    "MY_FEATURE_ENABLED": False,
    "MY_FEATURE_MODEL": "model-name",
    "MY_FEATURE_MAX_ITEMS": 3,
}
```

**File:** `gpt_researcher/config/variables/base.py`

```python
class BaseConfig(TypedDict):
    "MY_FEATURE_ENABLED": bool
    "MY_FEATURE_MODEL": Union[str, None]
    "MY_FEATURE_MAX_ITEMS": int
```

### Step 2: Create Provider

**File:** `gpt_researcher/llm_provider/my_feature/my_provider.py`

```python
class MyFeatureProvider:
    def __init__(self, api_key: str = None, model: str = None):
        self.api_key = api_key or os.getenv("MY_API_KEY")
        self.model = model
    
    def is_enabled(self) -> bool:
        return bool(self.api_key and self.model)
    
    async def execute(self, input_data: str) -> Dict[str, Any]:
        # API implementation
        pass
```

Export in `gpt_researcher/llm_provider/__init__.py`.

### Step 3: Create Skill

**File:** `gpt_researcher/skills/my_feature.py`

```python
class MyFeatureSkill:
    def __init__(self, researcher):
        self.researcher = researcher
        self.config = researcher.cfg
        self.provider = MyFeatureProvider(...)
    
    def is_enabled(self) -> bool:
        return getattr(self.config, 'my_feature_enabled', False) and self.provider.is_enabled()
    
    async def execute(self, context: str, query: str) -> List[Dict]:
        if not self.is_enabled():
            return []
        
        await stream_output("logs", "my_feature_start", "🚀 Starting...", self.researcher.websocket)
        results = await self.provider.execute(context)
        await stream_output("logs", "my_feature_complete", "✅ Done", self.researcher.websocket)
        
        return results
```

Export in `gpt_researcher/skills/__init__.py`.

### Step 4: Integrate into Agent

**File:** `gpt_researcher/agent.py`

```python
def __init__(self, ...):
    if self.cfg.my_feature_enabled:
        from gpt_researcher.skills import MyFeatureSkill
        self.my_feature = MyFeatureSkill(self)
    else:
        self.my_feature = None
    self.my_feature_results = []

async def conduct_research(self, ...):
    # ... existing ...
    if self.my_feature and self.my_feature.is_enabled():
        self.my_feature_results = await self.my_feature.execute(self.context, self.query)
```

### Step 5: Update Prompts

**File:** `gpt_researcher/prompts.py`

```python
@staticmethod
def generate_my_feature_prompt(context: str, query: str) -> str:
    return f"""..."""
```

### Step 6: WebSocket Events

Already handled via `stream_output()` in skill.

### Step 7: Frontend (if needed)

**File:** `frontend/nextjs/hooks/useWebSocket.ts`

```typescript
if (data.content === 'my_feature_start') {
    setStatus('processing');
}
```

### Step 8: Documentation

Create `docs/docs/gpt-researcher/gptr/my_feature.md`.

---

## Image Generation Case Study

This section shows the **actual implementation** of the Image Generation feature as a reference.

### 1. Configuration Added

**File:** `gpt_researcher/config/variables/default.py`

```python
DEFAULT_CONFIG: BaseConfig = {
    # ... existing ...
    "IMAGE_GENERATION_MODEL": "models/gemini-2.5-flash-image",
    "IMAGE_GENERATION_MAX_IMAGES": 3,
    "IMAGE_GENERATION_ENABLED": False,
    "IMAGE_GENERATION_STYLE": "dark",  # dark, light, auto
}
```

### 2. Provider Created

**File:** `gpt_researcher/llm_provider/image/image_generator.py`

```python
class ImageGeneratorProvider:
    def __init__(self, api_key: str = None, model: str = None):
        self.api_key = api_key or os.getenv("GOOGLE_API_KEY")
        self.model = model or "models/gemini-2.5-flash-image"
        self._client = None
    
    def is_enabled(self) -> bool:
        return bool(self.api_key and self.model)
    
    def _build_enhanced_prompt(self, prompt: str, context: str = "", style: str = "dark") -> str:
        """Add styling instructions to prompt."""
        if style == "dark":
            style_instructions = """
            Style: Dark mode professional infographic
            - Background: Dark (#0d1117)
            - Accents: Teal/cyan (#14b8a6)
            - Clean, modern, minimalist
            """
        # ... handle light, auto
        return f"{style_instructions}\n\nCreate: {prompt}\n\nContext: {context}"
    
    async def generate_image(
        self,
        prompt: str,
        context: str = "",
        research_id: str = "",
        style: str = "dark",
    ) -> List[Dict[str, Any]]:
        """Generate image using Gemini."""
        full_prompt = self._build_enhanced_prompt(prompt, context, style)
        
        # Call Gemini API
        response = await self._generate_with_gemini(full_prompt, output_path, ...)
        
        return [{"url": f"/outputs/images/{research_id}/img_{hash}.png", ...}]
```

### 3. Skill Created

**File:** `gpt_researcher/skills/image_generator.py`

```python
class ImageGenerator:
    def __init__(self, researcher):
        self.researcher = researcher
        self.config = researcher.cfg
        self.image_provider = ImageGeneratorProvider(
            api_key=os.getenv("GOOGLE_API_KEY"),
            model=getattr(self.config, 'image_generation_model', None),
        )
        self.max_images = getattr(self.config, 'image_generation_max_images', 3)
        self.style = getattr(self.config, 'image_generation_style', 'dark')
    
    def is_enabled(self) -> bool:
        enabled = getattr(self.config, 'image_generation_enabled', False)
        return enabled and self.image_provider.is_enabled()
    
    async def plan_and_generate_images(
        self,
        research_context: str,
        research_query: str,
        research_id: str,
        websocket: Any,
    ) -> List[Dict[str, Any]]:
        """
        1. Use LLM to identify visual concepts from context
        2. Generate images in parallel
        3. Return list of image metadata
        """
        # Stream progress
        await stream_output("logs", "image_planning", "🎨 Planning images...", websocket)
        
        # LLM identifies concepts
        concepts = await self._plan_image_concepts(research_context, research_query)
        
        # Generate images in parallel
        generated_images = []
        for i, concept in enumerate(concepts[:self.max_images]):
            await stream_output("logs", "image_generating", 
                f"🖼️ Generating image {i+1}/{len(concepts)}...", websocket)
            
            images = await self.image_provider.generate_image(
                prompt=concept["prompt"],
                context=concept.get("context", ""),
                research_id=research_id,
                style=self.style,
            )
            generated_images.extend(images)
        
        await stream_output("logs", "images_ready", 
            f"✅ Generated {len(generated_images)} images", websocket)
        
        return generated_images
```

### 4. Agent Integration

**File:** `gpt_researcher/agent.py`

```python
class GPTResearcher:
    def __init__(self, ...):
        # ... existing init ...
        
        # Initialize image generator if enabled
        if self.cfg.image_generation_enabled:
            from gpt_researcher.skills import ImageGenerator
            self.image_generator = ImageGenerator(self)
        else:
            self.image_generator = None
        
        self.available_images: List[Dict[str, Any]] = []
        self.research_id = self._generate_research_id(query)
    
    async def conduct_research(self, on_progress=None):
        # ... existing research ...
        
        self.context = await self.research_conductor.conduct_research()
        
        # Pre-generate images after research, before report writing
        if self.cfg.image_generation_enabled and self.image_generator and self.image_generator.is_enabled():
            self.available_images = await self.image_generator.plan_and_generate_images(
                research_context=self.context,
                research_query=self.query,
                research_id=self.research_id,
                websocket=self.websocket,
            )
        
        return self.context
    
    async def write_report(self, ...):
        report = await self.report_generator.write_report(
            # ... existing params ...
            available_images=self.available_images,  # Pass to report writer
        )
        return report
```

### 5. Prompt Updated

**File:** `gpt_researcher/prompts.py`

```python
@staticmethod
def generate_report_prompt(..., available_images: List[Dict[str, Any]] = []):
    image_instruction = ""
    if available_images:
        image_list = "\n".join([
            f"- Title: {img.get('title', 'Untitled')}\n  URL: {img['url']}"
            for img in available_images
        ])
        image_instruction = f"""
AVAILABLE IMAGES - Embed where relevant using ![Title](URL):
{image_list}
"""
    
    return f"""...(existing prompt)...
{image_instruction}
"""
```

---

## Testing New Features

```python
# tests/test_my_feature.py
import pytest
from gpt_researcher import GPTResearcher

@pytest.mark.asyncio
async def test_my_feature_disabled():
    """Test that feature is skipped when disabled."""
    researcher = GPTResearcher(query="test")
    # MY_FEATURE_ENABLED defaults to False
    assert researcher.my_feature is None

@pytest.mark.asyncio
async def test_my_feature_enabled(monkeypatch):
    """Test feature execution when enabled."""
    monkeypatch.setenv("MY_FEATURE_ENABLED", "true")
    monkeypatch.setenv("MY_API_KEY", "test-key")
    
    researcher = GPTResearcher(query="test")
    assert researcher.my_feature is not None
    assert researcher.my_feature.is_enabled()
```

### Running Tests

```bash
# All tests
python -m pytest tests/

# Specific test
python -m pytest tests/test_my_feature.py -v

# With coverage
python -m pytest tests/ --cov=gpt_researcher
```
📎 advanced-patterns.md
# Advanced Patterns Reference

## Table of Contents
- [Custom Callbacks](#custom-callbacks)
- [Custom WebSocket Handler](#custom-websocket-handler)
- [LangChain Integration](#langchain-integration)
- [Search Restrictions](#search-restrictions)
- [Error Handling Patterns](#error-handling-patterns)

---

## Custom Callbacks

```python
def cost_callback(cost: float):
    print(f"API call cost: ${cost}")

researcher = GPTResearcher(query="...")
researcher.add_costs = cost_callback  # Override cost tracking
```

---

## Custom WebSocket Handler

```python
class CustomWebSocket:
    def __init__(self):
        self.messages = []
    
    async def send_json(self, data):
        self.messages.append(data)
        if data['type'] == 'logs':
            print(f"Progress: {data['output']}")

researcher = GPTResearcher(query="...", websocket=CustomWebSocket())
```

---

## LangChain Integration

### Using with LangChain Documents

```python
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()

researcher = GPTResearcher(
    query="Summarize the documentation",
    report_source="langchain_documents",
    documents=documents,
)
```

### Using with Vector Store

```python
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents, embeddings)

researcher = GPTResearcher(
    query="Find relevant information",
    report_source="langchain_vectorstore",
    vector_store=vectorstore,
    vector_store_filter={"source": "docs"},
)
```

---

## Search Restrictions

### Restricting Search Domains

```python
researcher = GPTResearcher(
    query="Company news",
    query_domains=["reuters.com", "bloomberg.com", "wsj.com"],
)
```

### Using Specific Source URLs

```python
researcher = GPTResearcher(
    query="Analyze these articles",
    source_urls=[
        "https://example.com/article1",
        "https://example.com/article2",
    ],
    complement_source_urls=True,  # Also do web search
)
```

---

## Error Handling Patterns

### Graceful Degradation

```python
# In skills, always check is_enabled()
async def execute(self, ...):
    if not self.is_enabled():
        logger.warning("Feature not enabled, skipping")
        return []  # Return empty, don't crash
    
    try:
        result = await self.provider.execute(...)
        return result
    except Exception as e:
        logger.error(f"Feature error: {e}")
        await stream_output("logs", "feature_error", f"⚠️ Error: {e}", self.websocket)
        return []  # Graceful degradation
```

### API Rate Limiting

```python
# Providers should handle rate limits
async def execute(self, ...):
    try:
        return await self._call_api(...)
    except RateLimitError as e:
        logger.warning(f"Rate limited, waiting...")
        await asyncio.sleep(60)
        return await self._call_api(...)  # Retry
```

### WebSocket None Check

```python
# Always check websocket before sending
if self.researcher.websocket:
    await stream_output("logs", "event", "message", self.researcher.websocket)
```
📎 api-reference.md
# API Reference

## Table of Contents
- [REST API](#rest-api)
- [WebSocket API](#websocket-api)
- [Python Client](#python-client)
- [Output Files](#output-files)

---

## REST API

Base URL: `http://localhost:8000`

### Generate Report

**POST `/report/`**

```json
{
    "task": "What are the latest AI developments?",
    "report_type": "research_report",
    "report_source": "web",
    "tone": "Objective",
    "source_urls": [],
    "query_domains": [],
    "generate_in_background": false
}
```

**Response:**

```json
{
    "report": "# Research Report\n\n...",
    "research_id": "task_1234567890_query",
    "costs": 0.05,
    "pdf_path": "outputs/task_123.pdf",
    "docx_path": "outputs/task_123.docx"
}
```

### Chat with Report

**POST `/api/chat`**

```json
{
    "report": "The full report text...",
    "messages": [
        {"role": "user", "content": "What are the key findings?"}
    ]
}
```

### Report Management

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/reports` | List all reports |
| GET | `/api/reports/{id}` | Get single report |
| POST | `/api/reports` | Create/update report |
| PUT | `/api/reports/{id}` | Update report |
| DELETE | `/api/reports/{id}` | Delete report |

### File Operations

| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/upload/` | Upload document |
| DELETE | `/delete/{filename}` | Delete file |
| GET | `/outputs/{filename}` | Get output file |

### Configuration

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/getConfig` | Get current config |
| POST | `/setConfig` | Update config |

---

## WebSocket API

**Endpoint:** `ws://localhost:8000/ws`

### Send Research Request

```json
{
    "task": "Research query",
    "report_type": "research_report",
    "report_source": "web",
    "tone": "Objective",
    "source_urls": [],
    "mcp_enabled": false,
    "mcp_strategy": "fast",
    "mcp_configs": []
}
```

### Message Types (Server → Client)

| Type | Content | Description |
|------|---------|-------------|
| `logs` | `starting_research` | Research initiated |
| `logs` | `planning_research` | Generating sub-queries |
| `logs` | `running_subquery_research` | Researching sub-query |
| `logs` | `research_step_finalized` | Research complete |
| `logs` | `agent_generated` | Agent role selected |
| `logs` | `scraping_urls` | Scraping web pages |
| `logs` | `mcp_optimization` | MCP processing |
| `logs` | `image_planning` | Planning images |
| `logs` | `images_ready` | Images generated |
| `report` | - | Streaming report chunks |
| `report_complete` | - | Final complete report |
| `path` | `pdf`, `docx`, `md` | Output file paths |
| `error` | - | Error messages |
| `human_feedback` | `request` | Request user input |

### Message Format

```json
{
    "type": "logs",
    "content": "starting_research",
    "output": "🔍 Starting the research task...",
    "metadata": null
}
```

### Frontend Handler Example

```typescript
ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    
    switch (data.type) {
        case 'logs':
            setLogs(prev => [...prev, data]);
            break;
        case 'report':
            setAnswer(prev => prev + data.output);
            break;
        case 'report_complete':
            setAnswer(data.output);
            break;
        case 'path':
            setPaths(prev => ({...prev, [data.content]: data.output}));
            break;
        case 'error':
            setError(data.output);
            break;
    }
};
```

---

## Python Client

### Basic Usage

```python
from gpt_researcher import GPTResearcher
import asyncio

async def main():
    researcher = GPTResearcher(
        query="What are the latest AI developments?",
        report_type="research_report",
    )
    
    await researcher.conduct_research()
    report = await researcher.write_report()
    
    print(f"Report: {report}")
    print(f"Costs: ${researcher.get_costs()}")

asyncio.run(main())
```

### With MCP

```python
researcher = GPTResearcher(
    query="Research topic",
    mcp_configs=[{
        "name": "github",
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-github"],
        "env": {"GITHUB_TOKEN": os.getenv("GITHUB_TOKEN")}
    }],
    mcp_strategy="deep",
)
```

### With WebSocket Streaming

```python
class MockWebSocket:
    async def send_json(self, data):
        print(f"[{data['type']}] {data.get('output', '')}")

researcher = GPTResearcher(
    query="Research topic",
    websocket=MockWebSocket(),
)
```

### GPTResearcher Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `query` | str | required | Research question |
| `report_type` | str | `research_report` | Type of report |
| `report_source` | str | `web` | Data source |
| `tone` | Tone | `Objective` | Writing tone |
| `source_urls` | list | `[]` | Specific URLs to research |
| `document_urls` | list | `[]` | Document URLs |
| `query_domains` | list | `[]` | Restrict to domains |
| `config_path` | str | None | Path to JSON config |
| `websocket` | WebSocket | None | For streaming |
| `mcp_configs` | list | `[]` | MCP server configs |
| `mcp_strategy` | str | `fast` | MCP strategy |
| `verbose` | bool | `True` | Verbose output |

---

## Output Files

```
outputs/
├── task_{timestamp}_{query}.md
├── task_{timestamp}_{query}.pdf
├── task_{timestamp}_{query}.docx
└── images/
    └── {research_id}/
        └── img_{hash}_{index}.png
```

---

## Error Codes

| Code | Description |
|------|-------------|
| 400 | Bad Request - Invalid parameters |
| 404 | Not Found - Report not found |
| 429 | Rate Limited - API quota exceeded |
| 500 | Internal Server Error |
📎 architecture.md
# Architecture Reference

## Table of Contents
- [System Layers](#system-layers)
- [Key File Locations](#key-file-locations)

---

## System Layers

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              USER REQUEST                                    │
│              (query, report_type, report_source, tone, mcp_configs)         │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         BACKEND API LAYER                                    │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐          │
│  │  FastAPI Server  │  │ WebSocket Manager│  │  Report Store    │          │
│  │  backend/server/ │  │ Real-time events │  │  JSON persistence│          │
│  │  app.py          │  │ websocket_mgr.py │  │  report_store.py │          │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘          │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    GPTResearcher (gpt_researcher/agent.py)                   │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                         SKILLS LAYER                                   │  │
│  │  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐         │  │
│  │  │ ResearchConductor│ │ ReportGenerator │ │ ContextManager  │         │  │
│  │  │ Plan & gather   │ │ Write reports   │ │ Similarity search│         │  │
│  │  │ researcher.py   │ │ writer.py       │ │ context_manager │         │  │
│  │  └─────────────────┘ └─────────────────┘ └─────────────────┘         │  │
│  │  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐         │  │
│  │  │ BrowserManager  │ │ SourceCurator   │ │ ImageGenerator  │         │  │
│  │  │ Web scraping    │ │ Rank sources    │ │ Gemini images   │         │  │
│  │  │ browser.py      │ │ curator.py      │ │ image_generator │         │  │
│  │  └─────────────────┘ └─────────────────┘ └─────────────────┘         │  │
│  │  ┌─────────────────┐                                                  │  │
│  │  │ DeepResearchSkill│                                                 │  │
│  │  │ Recursive depth │                                                  │  │
│  │  │ deep_research.py│                                                  │  │
│  │  └─────────────────┘                                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                        ACTIONS LAYER                                   │  │
│  │  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐         │  │
│  │  │ report_generation│ │ query_processing│ │ web_scraping    │         │  │
│  │  │ LLM report write│ │ Sub-query plan  │ │ URL scraping    │         │  │
│  │  └─────────────────┘ └─────────────────┘ └─────────────────┘         │  │
│  │  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐         │  │
│  │  │ retriever.py    │ │ agent_creator   │ │ markdown_process│         │  │
│  │  │ Get retrievers  │ │ Choose agent    │ │ Parse markdown  │         │  │
│  │  └─────────────────┘ └─────────────────┘ └─────────────────┘         │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                       PROVIDERS LAYER                                  │  │
│  │  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐         │  │
│  │  │ LLM Provider    │ │ Retrievers      │ │ Scrapers        │         │  │
│  │  │ OpenAI,Anthropic│ │ Tavily,Google   │ │ BS4,Playwright  │         │  │
│  │  │ Google,Groq...  │ │ Bing,MCP...     │ │ PDF,DOCX...     │         │  │
│  │  │ llm_provider/   │ │ retrievers/     │ │ scraper/        │         │  │
│  │  └─────────────────┘ └─────────────────┘ └─────────────────┘         │  │
│  │  ┌─────────────────┐                                                  │  │
│  │  │ ImageGenerator  │                                                  │  │
│  │  │ Gemini/Imagen   │                                                  │  │
│  │  │ llm_provider/   │                                                  │  │
│  │  │ image/          │                                                  │  │
│  │  └─────────────────┘                                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        CONFIGURATION LAYER                                   │
│                     gpt_researcher/config/                                   │
│                                                                              │
│     Environment Variables  →  JSON Config File  →  Default Values            │
│           (highest)              (medium)            (lowest)                │
│                                                                              │
│     config.py loads and merges all sources                                   │
│     variables/default.py contains all defaults                               │
│     variables/base.py defines TypedDict for type safety                      │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Key File Locations

| Need | Primary File | Key Classes/Functions |
|------|--------------|----------------------|
| Main orchestrator | `gpt_researcher/agent.py` | `GPTResearcher` |
| Research logic | `gpt_researcher/skills/researcher.py` | `ResearchConductor` |
| Report writing | `gpt_researcher/skills/writer.py` | `ReportGenerator` |
| Context/embeddings | `gpt_researcher/skills/context_manager.py` | `ContextManager` |
| Source ranking | `gpt_researcher/skills/curator.py` | `SourceCurator` |
| Deep research | `gpt_researcher/skills/deep_research.py` | `DeepResearchSkill` |
| Image generation | `gpt_researcher/skills/image_generator.py` | `ImageGenerator` |
| All prompts | `gpt_researcher/prompts.py` | `PromptFamily` |
| Configuration | `gpt_researcher/config/config.py` | `Config` |
| Config defaults | `gpt_researcher/config/variables/default.py` | `DEFAULT_CONFIG` |
| Config types | `gpt_researcher/config/variables/base.py` | `BaseConfig` |
| API server | `backend/server/app.py` | FastAPI `app` |
| WebSocket mgmt | `backend/server/websocket_manager.py` | `WebSocketManager`, `run_agent` |
| Report types | `backend/report_type/` | `BasicReport`, `DetailedReport` |
| Search engines | `gpt_researcher/retrievers/` | `TavilySearch`, `GoogleSearch`, etc. |
| Web scraping | `gpt_researcher/scraper/` | Various scrapers |
| Enums | `gpt_researcher/utils/enum.py` | `ReportType`, `ReportSource`, `Tone` |
📎 components.md
# Core Components & Method Signatures

## Table of Contents
- [GPTResearcher](#gptresearcher)
- [ResearchConductor](#researchconductor)
- [ReportGenerator](#reportgenerator)

---

## GPTResearcher

**File:** `gpt_researcher/agent.py`

The main orchestrator class. Full initialization signature:

```python
class GPTResearcher:
    def __init__(
        self,
        query: str,                              # Research question (required)
        report_type: str = "research_report",    # research_report, detailed_report, deep, outline_report, resource_report
        report_format: str = "markdown",         # Output format
        report_source: str = "web",              # web, local, hybrid, azure, langchain_documents, langchain_vectorstore
        tone: Tone = Tone.Objective,             # Writing tone (see Tone enum)
        source_urls: list[str] | None = None,    # Specific URLs to research
        document_urls: list[str] | None = None,  # Document URLs to include
        complement_source_urls: bool = False,    # Add web search to source_urls
        query_domains: list[str] | None = None,  # Restrict search to domains
        documents=None,                          # LangChain document objects
        vector_store=None,                       # LangChain vector store
        vector_store_filter=None,                # Filter for vector store
        config_path=None,                        # Path to JSON config file
        websocket=None,                          # WebSocket for streaming
        agent=None,                              # Pre-defined agent type
        role=None,                               # Pre-defined agent role
        parent_query: str = "",                  # Parent query for subtopics
        subtopics: list | None = None,           # Subtopics to research
        visited_urls: set | None = None,         # Already visited URLs
        verbose: bool = True,                    # Verbose logging
        context=None,                            # Pre-loaded context
        headers: dict | None = None,             # HTTP headers
        max_subtopics: int = 5,                  # Max subtopics for detailed
        log_handler=None,                        # Custom log handler
        prompt_family: str | None = None,        # Custom prompt family
        mcp_configs: list[dict] | None = None,   # MCP server configurations
        mcp_max_iterations: int | None = None,   # Deprecated, use mcp_strategy
        mcp_strategy: str | None = None,         # fast, deep, disabled
        **kwargs
    ):
```

### Key Methods

```python
async def conduct_research(self, on_progress=None) -> str:
    """
    Main research orchestration.
    
    1. Selects agent role via LLM (choose_agent)
    2. Delegates to ResearchConductor
    3. Optionally generates images if enabled
    
    Returns: Accumulated research context as string
    """

async def write_report(
    self, 
    existing_headers: list = [],           # Headers to avoid duplication
    relevant_written_contents: list = [],  # Previous content for context
    ext_context=None,                      # External context override
    custom_prompt=""                       # Custom prompt override
) -> str:
    """
    Generate final report from context.
    
    Returns: Markdown report string
    """

def get_costs(self) -> float:
    """Returns total accumulated API costs."""

def add_costs(self, cost: float) -> None:
    """Add to running cost total (used as callback)."""
```

---

## ResearchConductor

**File:** `gpt_researcher/skills/researcher.py`

Manages the research process:

```python
class ResearchConductor:
    def __init__(self, researcher: GPTResearcher):
        self.researcher = researcher
        self.logger = logging.getLogger(__name__)

    async def plan_research(self, query: str, query_domains=None) -> list:
        """
        Generate sub-queries from main query using LLM.
        
        1. Gets initial search results
        2. Calls plan_research_outline() to generate sub-queries
        
        Returns: List of sub-query strings
        """

    async def conduct_research(self) -> str:
        """
        Main research execution based on report_source.
        
        Handles: web, local, hybrid, azure, langchain_documents, langchain_vectorstore
        
        For each source type:
        1. Load/search data
        2. Process sub-queries
        3. Combine context
        4. Optionally curate sources
        
        Returns: Combined research context string
        """

    async def _process_sub_query(
        self, 
        sub_query: str, 
        scraped_data: list = [], 
        query_domains: list = []
    ) -> str:
        """
        Process a single sub-query.
        
        1. Get MCP context (if configured, based on strategy)
        2. Scrape URLs from search results
        3. Get similar content via embeddings
        4. Combine MCP + web context
        
        Returns: Combined context for this sub-query
        """

    async def _get_context_by_web_search(
        self, 
        query: str, 
        scraped_data: list = [], 
        query_domains: list = []
    ) -> str:
        """Web-based research with sub-query planning."""

    async def _scrape_data_by_urls(
        self, 
        sub_query: str, 
        query_domains: list = []
    ) -> list:
        """Search and scrape URLs for a sub-query."""
```

---

## ReportGenerator

**File:** `gpt_researcher/skills/writer.py`

```python
class ReportGenerator:
    def __init__(self, researcher: GPTResearcher):
        self.researcher = researcher
        self.research_params = {
            "query": researcher.query,
            "agent_role_prompt": researcher.cfg.agent_role or researcher.role,
            "report_type": researcher.report_type,
            "report_source": researcher.report_source,
            "tone": researcher.tone,
            "websocket": researcher.websocket,
            "cfg": researcher.cfg,
            "headers": researcher.headers,
        }

    async def write_report(
        self,
        existing_headers: list = [],
        relevant_written_contents: list = [],
        ext_context=None,
        custom_prompt="",
        available_images: list = [],  # Pre-generated images to embed
    ) -> str:
        """
        Generate report using LLM.
        
        Calls generate_report() action with context and images.
        
        Returns: Markdown report
        """

    async def write_introduction(self, ...) -> str:
        """Write report introduction section."""

    async def write_conclusion(self, ...) -> str:
        """Write report conclusion with references."""
```
📎 config-reference.md
# Configuration Reference

## Table of Contents
- [Required Variables](#required-variables)
- [LLM Configuration](#llm-configuration)
- [Provider API Keys](#provider-api-keys)
- [Retriever Configuration](#retriever-configuration)
- [Report Configuration](#report-configuration)
- [Feature Toggles](#feature-toggles)
- [Configuration Priority](#configuration-priority)
- [Example .env](#example-env)

---

## Required Variables

```bash
OPENAI_API_KEY=sk-...          # Or another LLM provider key
TAVILY_API_KEY=tvly-...        # Or another retriever key
```

---

## LLM Configuration

```bash
LLM_PROVIDER=openai            # openai, anthropic, google, groq, together, etc.
FAST_LLM=gpt-4o-mini           # Quick tasks (summarization)
SMART_LLM=gpt-4o               # Complex reasoning (report writing)
STRATEGIC_LLM=o3-mini          # Planning (agent selection)
TEMPERATURE=0.4                # 0.0-1.0
MAX_TOKENS=4000
REASONING_EFFORT=medium        # For o-series: low, medium, high
```

---

## Provider API Keys

```bash
# OpenAI
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://api.openai.com/v1

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

# Google
GOOGLE_API_KEY=AIza...

# Groq
GROQ_API_KEY=gsk_...

# Azure OpenAI
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
```

---

## Retriever Configuration

```bash
RETRIEVER=tavily               # Single or comma-separated: tavily,google,mcp
MAX_SEARCH_RESULTS_PER_QUERY=5
MAX_URLS_TO_SCRAPE=10
SIMILARITY_THRESHOLD=0.42
```

### Retriever API Keys

```bash
TAVILY_API_KEY=tvly-...
GOOGLE_API_KEY=AIza...
GOOGLE_CX_KEY=...
BING_API_KEY=...
SERPER_API_KEY=...
SERPAPI_API_KEY=...
EXA_API_KEY=...
```

---

## Report Configuration

```bash
REPORT_FORMAT=apa              # apa, mla, chicago, harvard, ieee
TOTAL_WORDS=1000
LANGUAGE=english
CURATE_SOURCES=true
```

---

## Feature Toggles

### Image Generation

```bash
IMAGE_GENERATION_ENABLED=true
GOOGLE_API_KEY=AIza...
IMAGE_GENERATION_MODEL=models/gemini-2.5-flash-image
IMAGE_GENERATION_MAX_IMAGES=3
IMAGE_GENERATION_STYLE=dark    # dark, light, auto
```

### Deep Research

```bash
DEEP_RESEARCH_BREADTH=4        # Subtopics per level
DEEP_RESEARCH_DEPTH=2          # Recursion levels
DEEP_RESEARCH_CONCURRENCY=2    # Parallel tasks
```

### MCP

```bash
MCP_STRATEGY=fast              # fast, deep, disabled
```

### Local Documents

```bash
DOC_PATH=./my-docs
# Supports: PDF, DOCX, TXT, CSV, XLSX, PPTX, MD
```

### Server

```bash
HOST=0.0.0.0
PORT=8000
VERBOSE=true
```

---

## Configuration Priority

```
Environment Variables (highest)
        ↓
JSON Config File (if provided)
        ↓
Default Values (lowest)
```

**Important:** Config keys are lowercased when accessed:

```python
# In default.py: "SMART_LLM": "gpt-4o"
# Access as: self.cfg.smart_llm  # lowercase!
```

---

## Example .env

```bash
# Required
OPENAI_API_KEY=sk-your-key
TAVILY_API_KEY=tvly-your-key

# LLM
FAST_LLM=gpt-4o-mini
SMART_LLM=gpt-4o

# Report
TOTAL_WORDS=1000
LANGUAGE=english

# Optional: Images
IMAGE_GENERATION_ENABLED=true
GOOGLE_API_KEY=AIza-your-key
IMAGE_GENERATION_STYLE=dark
```
📎 deep-research.md
# Deep Research Mode Reference

## Table of Contents
- [Overview](#overview)
- [Configuration](#configuration)
- [DeepResearchSkill](#deepresearchskill)
- [Usage](#usage)

---

## Overview

Deep Research uses recursive tree-like exploration with configurable depth and breadth.

---

## Configuration

```bash
DEEP_RESEARCH_BREADTH=4    # Subtopics per level
DEEP_RESEARCH_DEPTH=2      # Recursion levels
DEEP_RESEARCH_CONCURRENCY=2  # Parallel tasks
```

---

## DeepResearchSkill

**File:** `gpt_researcher/skills/deep_research.py`

```python
class DeepResearchSkill:
    def __init__(self, researcher):
        self.researcher = researcher
        self.breadth = getattr(researcher.cfg, 'deep_research_breadth', 4)
        self.depth = getattr(researcher.cfg, 'deep_research_depth', 2)
        self.concurrency_limit = getattr(researcher.cfg, 'deep_research_concurrency', 2)
        self.learnings = []
        self.research_sources = []
        self.context = []

    async def deep_research(self, query: str, on_progress=None) -> str:
        """
        Recursive research with depth and breadth.
        
        1. Research main topic
        2. Generate subtopics (breadth)
        3. For each subtopic, recursively research (depth)
        4. Aggregate all findings
        5. Generate comprehensive report
        """
```

---

## Usage

```python
researcher = GPTResearcher(
    query="Comprehensive analysis of quantum computing",
    report_type="deep",  # Triggers deep research
)
await researcher.conduct_research()
report = await researcher.write_report()
```

### Research Tree Structure

```
Query: "Quantum Computing"
├── Subtopic 1: Hardware (depth 1)
│   ├── Subtopic 1.1: Superconducting qubits (depth 2)
│   └── Subtopic 1.2: Ion traps (depth 2)
├── Subtopic 2: Algorithms (depth 1)
│   ├── Subtopic 2.1: Shor's algorithm (depth 2)
│   └── Subtopic 2.2: Grover's algorithm (depth 2)
├── Subtopic 3: Applications (depth 1)
│   └── ...
└── Subtopic 4: Challenges (depth 1)
    └── ...
```

With `DEEP_RESEARCH_BREADTH=4` and `DEEP_RESEARCH_DEPTH=2`, this explores 4 subtopics at each level, going 2 levels deep.
📎 flows.md
# Research Flow & Data Flow

## Table of Contents
- [End-to-End Research Flow](#end-to-end-research-flow)
- [Data Flow Between Components](#data-flow-between-components)

---

## End-to-End Research Flow

### 1. Request Entry

**File:** `backend/server/app.py`

```python
# REST API endpoint
@app.post("/report/")
async def generate_report(research_request: ResearchRequest, background_tasks: BackgroundTasks):
    research_id = sanitize_filename(f"task_{int(time.time())}_{research_request.task}")
    # Calls write_report() which uses run_agent()

# WebSocket endpoint
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await manager.connect(websocket)
    await handle_websocket_communication(websocket, manager)
```

### 2. Agent Runner

**File:** `backend/server/websocket_manager.py`

```python
async def run_agent(task, report_type, report_source, source_urls, ...):
    """Main entry point for research execution."""
    # Create logs handler
    logs_handler = CustomLogsHandler(websocket, task)
    
    # Configure MCP if enabled
    if mcp_enabled and mcp_configs:
        os.environ["RETRIEVER"] = f"{current_retriever},mcp"
        os.environ["MCP_STRATEGY"] = mcp_strategy
    
    # Route based on report type
    if report_type == "multi_agents":
        report = await run_research_task(query=task, websocket=logs_handler, ...)
    elif report_type == ReportType.DetailedReport.value:
        researcher = DetailedReport(query=task, ...)
        report = await researcher.run()
    else:
        researcher = BasicReport(query=task, ...)
        report = await researcher.run()
    
    return report
```

### 3. Research Phase

**File:** `gpt_researcher/agent.py`

```python
async def conduct_research(self, on_progress=None):
    # Handle deep research separately
    if self.report_type == ReportType.DeepResearch.value and self.deep_researcher:
        return await self._handle_deep_research(on_progress)
    
    # Choose agent role via LLM
    if not (self.agent and self.role):
        self.agent, self.role = await choose_agent(
            query=self.query,
            cfg=self.cfg,
            parent_query=self.parent_query,
            cost_callback=self.add_costs,
            headers=self.headers,
            prompt_family=self.prompt_family,
        )
    
    # Conduct research
    self.context = await self.research_conductor.conduct_research()
    
    # Generate images if enabled (pre-generation for seamless UX)
    if self.cfg.image_generation_enabled and self.image_generator:
        self.available_images = await self.image_generator.plan_and_generate_images(
            research_context=self.context,
            research_query=self.query,
            research_id=self.research_id,
            websocket=self.websocket,
        )
    
    return self.context
```

### 4. Sub-Query Processing

**File:** `gpt_researcher/skills/researcher.py`

```python
async def _process_sub_query(self, sub_query: str, scraped_data: list = [], query_domains: list = []):
    # MCP Strategy handling
    mcp_retrievers = [r for r in self.researcher.retrievers if "mcpretriever" in r.__name__.lower()]
    mcp_strategy = self._get_mcp_strategy()
    
    if mcp_retrievers:
        if mcp_strategy == "fast" and self._mcp_results_cache is not None:
            # Reuse cached MCP results
            mcp_context = self._mcp_results_cache.copy()
        elif mcp_strategy == "deep":
            # Run MCP for every sub-query
            mcp_context = await self._execute_mcp_research_for_queries([sub_query], mcp_retrievers)
    
    # Get web search context
    if not scraped_data:
        scraped_data = await self._scrape_data_by_urls(sub_query, query_domains)
    
    # Get similar content via embeddings
    if scraped_data:
        web_context = await self.researcher.context_manager.get_similar_content_by_query(
            sub_query, scraped_data
        )
    
    # Combine MCP + web context
    combined_context = self._combine_mcp_and_web_context(mcp_context, web_context, sub_query)
    return combined_context
```

### 5. Report Generation

**File:** `gpt_researcher/actions/report_generation.py`

```python
async def generate_report(
    query: str,
    context: str,
    agent_role_prompt: str,
    report_type: str,
    websocket=None,
    cfg=None,
    tone=None,
    headers=None,
    cost_callback=None,
    prompt_family=None,
    available_images: list = [],
    **kwargs
) -> str:
    """Generate report using LLM."""
    # Get prompt generator
    generate_prompt = prompt_family.get_prompt_by_report_type(report_type)
    
    # Build prompt with context and available images
    content = generate_prompt(
        query, context, report_source,
        report_format=cfg.report_format,
        tone=tone,
        total_words=cfg.total_words,
        language=cfg.language,
        available_images=available_images,
    )
    
    # Call LLM
    report = await create_chat_completion(
        model=cfg.smart_llm,
        messages=[{"role": "user", "content": content}],
        temperature=cfg.temperature,
        llm_provider=cfg.smart_llm_provider,
        max_tokens=cfg.smart_token_limit,
        llm_kwargs=cfg.llm_kwargs,
        cost_callback=cost_callback,
    )
    
    return report
```

---

## Data Flow Between Components

```
User Query
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ GPTResearcher.__init__()                                         │
│   • Loads Config (env → json → defaults)                        │
│   • Initializes skills: ResearchConductor, ReportGenerator, etc │
│   • Initializes retrievers based on RETRIEVER env var           │
│   • Initializes ImageGenerator if IMAGE_GENERATION_ENABLED      │
└─────────────────────────────────────────────────────────────────┘
    │
    │  researcher.conduct_research()
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ choose_agent()                                                   │
│   Input: query, config                                          │
│   Output: (agent_type: str, role_prompt: str)                   │
│   • LLM selects best agent role for the query                   │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ ResearchConductor.conduct_research()                             │
│   Input: self.researcher (has query, config, retrievers)        │
│   Output: context: str                                          │
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ plan_research()                                          │   │
│   │   Input: query                                           │   │
│   │   Output: sub_queries: list[str]                         │   │
│   │   • Calls LLM to generate 3-5 sub-queries                │   │
│   └─────────────────────────────────────────────────────────┘   │
│                          │                                       │
│                          ▼                                       │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ For each sub_query:                                      │   │
│   │   _process_sub_query()                                   │   │
│   │     Input: sub_query                                     │   │
│   │     Output: sub_context: str                             │   │
│   │                                                          │   │
│   │     1. MCP retrieval (if configured)                     │   │
│   │        → mcp_context: list[dict]                         │   │
│   │                                                          │   │
│   │     2. Web search via retrievers                         │   │
│   │        → search_results: list[dict]                      │   │
│   │                                                          │   │
│   │     3. Scrape URLs                                       │   │
│   │        → scraped_content: list[dict]                     │   │
│   │                                                          │   │
│   │     4. Similarity search via embeddings                  │   │
│   │        → relevant_context: str                           │   │
│   │                                                          │   │
│   │     5. Combine MCP + web context                         │   │
│   │        → combined_context: str                           │   │
│   └─────────────────────────────────────────────────────────┘   │
│                          │                                       │
│                          ▼                                       │
│   Aggregate all sub_contexts → final context: str               │
└─────────────────────────────────────────────────────────────────┘
    │
    │  If IMAGE_GENERATION_ENABLED:
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ ImageGenerator.plan_and_generate_images()                        │
│   Input: context, query, research_id                            │
│   Output: available_images: list[dict]                          │
│     [{"url": "/outputs/images/.../img.png",                     │
│       "title": "...", "description": "..."}]                    │
│                                                                  │
│   1. LLM analyzes context for visual concepts                   │
│   2. Generates 2-3 images in parallel via Gemini                │
│   3. Saves to outputs/images/{research_id}/                     │
└─────────────────────────────────────────────────────────────────┘
    │
    │  researcher.write_report()
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ ReportGenerator.write_report()                                   │
│   Input: context, available_images                              │
│   Output: report: str (markdown)                                │
│                                                                  │
│   → generate_report() action                                    │
│       • Builds prompt with context + image list                 │
│       • LLM generates report with embedded images               │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ Output                                                           │
│   • Streamed via WebSocket (type: "report")                     │
│   • Final via WebSocket (type: "report_complete")               │
│   • Exported to PDF, DOCX, Markdown                             │
│   • Saved to outputs/ directory                                 │
└─────────────────────────────────────────────────────────────────┘
```
📎 mcp.md
# MCP Integration Reference

## Table of Contents
- [Overview](#overview)
- [Configuration](#configuration)
- [Strategy Options](#strategy-options)
- [Processing Logic](#processing-logic)

---

## Overview

MCP (Model Context Protocol) enables research from specialized data sources (GitHub, databases, APIs) alongside web search.

---

## Configuration

```python
researcher = GPTResearcher(
    query="...",
    mcp_configs=[
        {
            "name": "github",                    # Server name
            "command": "npx",                    # Command to start
            "args": ["-y", "@modelcontextprotocol/server-github"],
            "env": {"GITHUB_TOKEN": "..."},      # Environment vars
        },
        {
            "name": "filesystem",
            "command": "npx",
            "args": ["-y", "@anthropic/mcp-server-filesystem", "/docs"],
        },
        {
            "name": "remote",
            "connection_url": "ws://server:8080",  # WebSocket connection
            "connection_type": "websocket",
            "connection_token": "auth_token",
        }
    ],
    mcp_strategy="fast",  # fast, deep, disabled
)
```

---

## Strategy Options

| Strategy | Behavior | Use Case |
|----------|----------|----------|
| `fast` (default) | Run MCP once with original query, cache results | Performance-focused |
| `deep` | Run MCP for every sub-query | Maximum thoroughness |
| `disabled` | Skip MCP entirely | Web-only research |

---

## Processing Logic

**File:** `gpt_researcher/skills/researcher.py`

```python
# At start of research (for 'fast' strategy)
if mcp_strategy == "fast":
    mcp_context = await self._execute_mcp_research_for_queries([query], mcp_retrievers)
    self._mcp_results_cache = mcp_context  # Cache for reuse

# During sub-query processing
if mcp_strategy == "fast" and self._mcp_results_cache is not None:
    mcp_context = self._mcp_results_cache.copy()  # Reuse cache
elif mcp_strategy == "deep":
    mcp_context = await self._execute_mcp_research_for_queries([sub_query], mcp_retrievers)
```

### WebSocket Request Example

```json
{
    "task": "Research query",
    "report_type": "research_report",
    "mcp_enabled": true,
    "mcp_strategy": "fast",
    "mcp_configs": [
        {
            "name": "github",
            "command": "npx",
            "args": ["-y", "@modelcontextprotocol/server-github"],
            "env": {"GITHUB_TOKEN": "..."}
        }
    ]
}
```
📎 multi-agents.md
# Multi-Agent System Reference

## Table of Contents
- [Overview](#overview)
- [Agent Roles](#agent-roles)
- [Workflow](#workflow)
- [Usage](#usage)

---

## Overview

**Directory:** `multi_agents/`

LangGraph-based system inspired by [STORM paper](https://arxiv.org/abs/2402.14207). Generates 5-6 page reports with multiple agents collaborating.

---

## Agent Roles

| Agent | File | Role |
|-------|------|------|
| Human | - | Oversees and provides feedback |
| Chief Editor | `agents/editor.py` | Master coordinator via LangGraph |
| Researcher | Uses GPTResearcher | Deep research on topics |
| Editor | `agents/editor.py` | Plans outline and structure |
| Reviewer | `agents/reviewer.py` | Validates research correctness |
| Revisor | `agents/revisor.py` | Revises based on feedback |
| Writer | `agents/writer.py` | Compiles final report |
| Publisher | `agents/publisher.py` | Exports to PDF, DOCX, Markdown |

---

## Workflow

```
1. Browser (GPTResearcher) → Initial research
2. Editor → Plans report outline
3. For each outline topic (parallel):
   a. Researcher → In-depth subtopic research
   b. Reviewer → Validates draft
   c. Revisor → Revises until satisfactory
4. Writer → Compiles final report
5. Publisher → Exports to multiple formats
```

---

## Usage

### Via API

```python
report_type = "multi_agents"
```

### Via WebSocket

```json
{
    "task": "Research query",
    "report_type": "multi_agents",
    "tone": "Analytical"
}
```

### Directly in Python

```python
from multi_agents import run_research_task

report = await run_research_task(
    query="Comprehensive analysis of market trends",
    websocket=handler,
    tone=Tone.Analytical,
)
```

### Configuration File

**File:** `multi_agents/task.json`

Configure the multi-agent research task parameters and agent behaviors.
📎 prompts.md
# Prompt System Reference

## Table of Contents
- [PromptFamily Class](#promptfamily-class)
- [Key Prompt Examples](#key-prompt-examples)

---

## PromptFamily Class

**File:** `gpt_researcher/prompts.py`

All prompts are centralized in the `PromptFamily` class. This allows for model-specific prompt variations.

```python
class PromptFamily:
    """
    General purpose class for prompt formatting.
    Can be overwritten with model-specific derived classes.
    """

    def __init__(self, config: Config):
        self.cfg = config

    @staticmethod
    def get_prompt_by_report_type(report_type: str):
        """Returns the appropriate prompt generator for the report type."""
        match report_type:
            case ReportType.ResearchReport.value:
                return PromptFamily.generate_report_prompt
            case ReportType.DetailedReport.value:
                return PromptFamily.generate_report_prompt
            case ReportType.OutlineReport.value:
                return PromptFamily.generate_outline_report_prompt
            # ... etc
```

---

## Key Prompt Examples

### Agent Selection Prompt

```python
@staticmethod
def generate_agent_role_prompt(query: str, parent_query: str = "") -> str:
    return f"""Analyze the research query and select the most appropriate agent role.

Query: "{query}"
{f'Parent Query: "{parent_query}"' if parent_query else ''}

Based on the query, determine:
1. The domain expertise needed
2. The research approach required
3. The appropriate agent persona

Return a JSON object with:
- "agent": The agent type (e.g., "Research Analyst", "Technical Writer")
- "role": A detailed role description for how the agent should approach this research
"""
```

### Research Planning Prompt

```python
@staticmethod
def generate_search_queries_prompt(
    query: str,
    parent_query: str = "",
    report_type: str = "",
    max_iterations: int = 3,
    context: str = "",
) -> str:
    return f"""Generate {max_iterations} focused search queries to research: "{query}"

Context from initial search:
{context}

Requirements:
- Each query should explore a different aspect
- Queries should be specific and searchable
- Consider the report type: {report_type}

Return a JSON array of query strings.
"""
```

### Report Generation Prompt (with images)

```python
@staticmethod
def generate_report_prompt(
    question: str,
    context: str,
    report_source: str,
    report_format="apa",
    total_words=1000,
    tone=None,
    language="english",
    available_images: list = [],
) -> str:
    # Build image embedding instruction if images available
    image_instruction = ""
    if available_images:
        image_list = "\n".join([
            f"- Title: {img.get('title')}\n  URL: {img['url']}"
            for img in available_images
        ])
        image_instruction = f"""
AVAILABLE IMAGES (embed where relevant):
{image_list}

Use markdown format: ![Title](URL)
"""

    return f"""Information: "{context}"
---
Using the above information, answer: "{question}" in a detailed report.

- Format: {report_format}
- Length: ~{total_words} words
- Tone: {tone.value if tone else "Objective"}
- Language: {language}
- Include citations for all factual claims
{image_instruction}
"""
```

### MCP Tool Selection Prompt

```python
@staticmethod
def generate_mcp_tool_selection_prompt(query: str, tools_info: list, max_tools: int = 3) -> str:
    return f"""Select the most relevant tools for researching: "{query}"

AVAILABLE TOOLS:
{json.dumps(tools_info, indent=2)}

Select exactly {max_tools} tools ranked by relevance.

Return JSON:
{{
  "selected_tools": [
    {{"index": 0, "name": "tool_name", "relevance_score": 9, "reason": "..."}}
  ]
}}
"""
```
📎 retrievers.md
# Retriever System Reference

## Table of Contents
- [Available Retrievers](#available-retrievers)
- [Retriever Selection](#retriever-selection)
- [Adding a New Retriever](#adding-a-new-retriever)

---

## Available Retrievers

**Directory:** `gpt_researcher/retrievers/`

| Retriever | Class | API Key Env Var |
|-----------|-------|-----------------|
| Tavily | `TavilySearch` | `TAVILY_API_KEY` |
| Google | `GoogleSearch` | `GOOGLE_API_KEY`, `GOOGLE_CX_KEY` |
| DuckDuckGo | `Duckduckgo` | None |
| Bing | `BingSearch` | `BING_API_KEY` |
| Serper | `SerperSearch` | `SERPER_API_KEY` |
| SerpAPI | `SerpApiSearch` | `SERPAPI_API_KEY` |
| SearchAPI | `SearchApiSearch` | `SEARCHAPI_API_KEY` |
| Exa | `ExaSearch` | `EXA_API_KEY` |
| arXiv | `ArxivSearch` | None |
| Semantic Scholar | `SemanticScholarSearch` | None |
| PubMed Central | `PubMedCentralSearch` | None |
| MCP | `MCPRetriever` | Per-server |
| Custom | `CustomRetriever` | User-defined |

---

## Retriever Selection

**File:** `gpt_researcher/actions/retriever.py`

```python
def get_retriever(retriever: str):
    """Get a retriever class by name."""
    match retriever:
        case "tavily":
            from gpt_researcher.retrievers import TavilySearch
            return TavilySearch
        case "google":
            from gpt_researcher.retrievers import GoogleSearch
            return GoogleSearch
        case "mcp":
            from gpt_researcher.retrievers import MCPRetriever
            return MCPRetriever
        # ... etc

def get_retrievers(retriever_names: str, headers: dict = None) -> list:
    """
    Get multiple retrievers from comma-separated string.
    
    Usage: RETRIEVER=tavily,google,mcp
    """
    retrievers = []
    for name in retriever_names.split(","):
        retriever_class = get_retriever(name.strip())
        if retriever_class:
            retrievers.append(retriever_class)
    return retrievers
```

---

## Adding a New Retriever

### Step 1: Create Retriever File

**File:** `gpt_researcher/retrievers/my_retriever/my_retriever.py`

```python
class MyRetriever:
    def __init__(self, query: str, headers: dict = None):
        self.query = query
        self.headers = headers
    
    async def search(self, max_results: int = 10) -> list[dict]:
        """
        Returns list of:
        {
            "title": str,
            "href": str,
            "body": str
        }
        """
        # Implementation
        pass
```

### Step 2: Register in retriever.py

**File:** `gpt_researcher/actions/retriever.py`

```python
case "my_retriever":
    from gpt_researcher.retrievers.my_retriever import MyRetriever
    return MyRetriever
```

### Step 3: Export in __init__.py

**File:** `gpt_researcher/retrievers/__init__.py`

```python
from .my_retriever import MyRetriever
__all__ = [..., "MyRetriever"]
```

### Step 4: Usage

```bash
RETRIEVER=tavily,my_retriever
```

```python
researcher = GPTResearcher(
    query="...",
    # Will use both Tavily and your custom retriever
)
```