Skip to content

Configuration

Customize Nancy Brain's behavior through configuration files and environment variables.

Configuration Files

Nancy Brain uses YAML configuration files to define repositories, search weights, and processing options.

Repository Configuration (config/repositories.yml)

Define which GitHub repositories to index:

# Group repositories by category
ml_frameworks:
  - name: "scikit-learn"
    url: "https://github.com/scikit-learn/scikit-learn"
    description: "Machine learning library for Python"
    is_notebook: false

  - name: "pytorch"  
    url: "https://github.com/pytorch/pytorch"
    description: "Deep learning framework"
    is_notebook: false

documentation:
  - name: "python-docs"
    url: "https://github.com/python/cpython"
    description: "Python source and documentation" 
    is_notebook: false

jupyter_examples:
  - name: "data-science-notebooks"
    url: "https://github.com/your-org/notebooks"
    description: "Data science tutorial notebooks"
    is_notebook: true  # Special handling for notebooks

Fields: - name: Unique identifier within the category - url: Git repository URL (HTTPS or SSH) - description: Human-readable description - is_notebook: Set to true for notebook-heavy repositories - ref (optional): Pin to a specific branch, tag, or commit SHA (e.g. "v2.1.0", "main", "a3f9d12"). If omitted, the default branch tip is used.

Categories become path prefixes in search results (e.g., ml_frameworks/scikit-learn/...).

Importing Dependencies

If you already have an environment.yml, you can auto-populate repositories.yml with the GitHub source repos for your pip dependencies:

nancy-brain import-env -f environment.yml

Nancy Brain queries PyPI for each pip package, finds its GitHub Source URL, and appends new entries to config/repositories.yml. Packages with no GitHub link, editable installs (-e .), and duplicates are skipped automatically.

The command auto-detects the file format from the filename:

File Format detected
environment.yml / *.yaml conda environment
requirements.txt / *.txt / *.in pip requirements
pyproject.toml PEP 621 or Poetry
# requirements.txt
nancy-brain import-env -f requirements.txt

# pyproject.toml (PEP 621 [project].dependencies or Poetry)
nancy-brain import-env -f pyproject.toml

# Pin each entry to the exact installed version (sets ref: in repositories.yml)
nancy-brain import-env -f environment.yml --pin-versions
nancy-brain import-env -f requirements.txt --pin-versions

# Preview what would be added without writing
nancy-brain import-env -f environment.yml --dry-run

# Use a custom category name instead of the derived default
nancy-brain import-env -f environment.yml --category my_project

# Write to a different file
nancy-brain import-env -f environment.yml --output path/to/repositories.yml

--pin-versions only sets ref when the spec has an exact == pin. Packages specified with >=, ~=, or no version are added without a ref.

Tip

After importing, review the generated entries and remove any repos that are too large or irrelevant before running nancy-brain build.

Search Weights (config/weights.yaml)

Control which files get higher priority in search results:

# Base multipliers by file extension
extensions:
  .md: 1.5        # Boost markdown documentation
  .rst: 1.5       # Boost reStructuredText docs  
  .txt: 1.2       # Boost plain text files
  .py: 1.0        # Standard weight for Python code
  .js: 0.9        # Slightly lower for JavaScript
  .json: 0.7      # Lower weight for config files
  .log: 0.3       # Very low for log files

# Additional multipliers for path patterns
path_includes:
  "README": 2.0       # Heavily boost README files
  "tutorial": 1.8     # Boost tutorial content
  "example": 1.5      # Boost example code
  "guide": 1.5        # Boost guide documentation
  "docs/": 1.3        # Boost files in docs directories
  "test": 0.8         # Lower weight for test files
  "__pycache__": 0.1  # Very low for cache files

How it works: 1. Base weight starts at 1.0 2. Extension multiplier is applied 3. Each matching path pattern multiplier is applied 4. Final weight = base × extension × path₁ × path₂ × ...

PDF Articles (config/articles.yml)

Optionally include PDF research papers and documentation:

research_papers:
  - name: "attention_is_all_you_need"
    url: "https://arxiv.org/pdf/1706.03762.pdf"
    description: "Transformer architecture paper (Vaswani et al.)"

  - name: "bert_paper"
    url: "https://arxiv.org/pdf/1810.04805.pdf" 
    description: "BERT: Pre-training of Deep Bidirectional Transformers"

documentation:
  - name: "python_tutorial"
    url: "https://docs.python.org/3/tutorial/tutorial.pdf"
    description: "Official Python Tutorial (PDF)"

PDF Processing Requirements: - Java (for Apache Tika): brew install openjdk - Fallback libraries: pip install PyPDF2 pdfplumber pymupdf

Environment Variables

Core Settings

Variable Default Description
USE_DUAL_EMBEDDING true Enable separate code and text embedding models
CODE_EMBEDDING_MODEL microsoft/codebert-base Model for code embeddings
KMP_DUPLICATE_LIB_OK TRUE Fix OpenMP conflicts on macOS

PDF Processing

Variable Default Description
JAVA_HOME Auto-detected Java installation path
TIKA_SERVER_TIMEOUT 60 Tika server timeout (seconds)
SKIP_PDF_PROCESSING false Skip PDF processing entirely

Development

Variable Default Description
NANCY_LOG_LEVEL INFO Logging level (DEBUG, INFO, WARNING, ERROR)
NANCY_CACHE_SIZE 1000 Search result cache size

Runtime Configuration

Dynamic Weights

Adjust document importance during runtime:

**Check knowledge base content:**
```bash
nancy-brain search "test" --limit 10
### Search Filters

Apply filters during search:

```bash
# Filter by category
nancy-brain search "machine learning" --category ml_frameworks

# Filter by file type
nancy-brain search "tutorial" --filetype .md

# Combine filters
nancy-brain search "example" --category documentation --filetype .py

Advanced Configuration

Custom Embedding Models

Override the default embedding models:

export USE_DUAL_EMBEDDING=true
export CODE_EMBEDDING_MODEL="microsoft/graphcodebert-base"
export TEXT_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"

Build Options

Customize the knowledge base build process:

# Build with custom paths
nancy-brain build \
  --config custom-repos.yml \
  --embeddings-path custom/embeddings \
  --force-update

# Build specific categories only
nancy-brain build --category ml_frameworks

# Keep raw files for debugging
nancy-brain build --dirty

Performance Tuning

For large knowledge bases:

# config/performance.yml
build:
  batch_size: 100        # Documents per batch
  max_workers: 4         # Parallel processing threads
  chunk_size: 512        # Text chunk size for embedding

search:
  cache_size: 2000       # Larger cache for frequent searches
  timeout: 30            # Search timeout seconds
  max_results: 100       # Maximum results to rank

Configuration Examples

Academic Research Setup

Perfect for research papers and related code:

# config/repositories.yml
research_papers:
  - name: "paper-code"
    url: "https://github.com/author/paper-implementation"
    description: "Implementation code for our paper"
    is_notebook: true

reference_implementations:
  - name: "baseline-methods"
    url: "https://github.com/org/baseline-methods"
    description: "Baseline implementation for comparison"
    is_notebook: false
# config/weights.yaml
extensions:
  .md: 2.0      # Heavy boost for documentation
  .ipynb: 1.8   # Boost notebooks
  .py: 1.5      # Boost Python code
  .pdf: 2.5     # Heavy boost for PDFs

path_includes:
  "paper": 3.0      # Heavily boost paper-related content
  "results": 2.0    # Boost results and figures
  "notebook": 1.8   # Boost notebook directories

Software Development Setup

Focused on code documentation and examples:

# config/repositories.yml
frameworks:
  - name: "main-framework"
    url: "https://github.com/your-org/framework"
    description: "Main framework repository"
    is_notebook: false

  - name: "examples"
    url: "https://github.com/your-org/examples"
    description: "Usage examples and tutorials"
    is_notebook: true

dependencies:
  - name: "core-library"
    url: "https://github.com/org/core-lib"
    description: "Core dependency library"
    is_notebook: false
# config/weights.yaml
extensions:
  .py: 1.5      # Boost Python code
  .md: 1.8      # Boost documentation
  .yaml: 1.2    # Boost config files
  .json: 0.8    # Lower weight for data files

path_includes:
  "README": 3.0     # Critical for understanding
  "docs/": 2.0      # Documentation directory
  "examples/": 1.8  # Example code
  "api/": 1.5       # API documentation
  "test": 0.7       # Lower weight for tests

Troubleshooting Configuration

Validation

Check your configuration files:

# Validate repository config
nancy-brain validate-config config/repositories.yml

# Test search weights
nancy-brain test-weights config/weights.yaml

# Check PDF article URLs
nancy-brain validate-articles config/articles.yml

Common Issues

Repository clone failures:

# Test repository access
git ls-remote https://github.com/org/repo.git

# Check SSH key setup for private repos
ssh -T git@github.com

PDF download failures:

# Test PDF URLs manually
curl -I https://arxiv.org/pdf/1706.03762.pdf

# Check Java installation for Tika
java -version
echo $JAVA_HOME

Search weight not working:

# Check weight calculation
nancy-brain explain-weights "path/to/document.md"

# Test specific patterns
nancy-brain test-pattern "README" --weights config/weights.yaml

Next Steps