GitHub Copilot and Large Codebases
Working effectively across a large codebase requires reliable context, predictable patterns, and disciplined decomposition. This chapter explains how Copilot uses code context (local and remote indexing), how to augment that context with instructions and prompt files, and practical chunking strategies that scale.
Local indexing
Modern IDEs index the workspace to improve symbol search, navigation, and context available to Copilot. In addition, language servers (Language Server Protocol, LSP) expose structure—types, signatures, references—that helps Copilot generate syntactically valid and idiomatic code. Local indexing is particularly valuable when working offline, on private repositories, or when rapidly iterating in a branch.
Research continues on code representations optimised for large language models (LLMs), aiming to preserve relationships between files, symbols, and architectural boundaries so that changes remain coherent across a codebase.
Remote indexing
For repositories hosted on GitHub.com, Copilot can leverage repository indexes maintained by GitHub to enrich context. This avoids expensive local scans for very large repositories and improves retrieval of related files during suggestions. Remote indexing is complementary to local indexing; together they provide faster, more relevant context without manual curation.
Augmenting context with instructions and prompts
Before instruction files and prompt files were available, teams often built up chat context step by step to guide the model. With repository-scoped guidance, prompts can be shorter and more reliable.
Recommended approach:
- Add
.github/copilot-instructions.md
to define coding standards, documentation locations, testing conventions, and terminology - Reference high-signal docs (for example,
docs/architecture.md
,docs/coding-standards.md
) so Copilot can consult them without re-prompting - Create prompt files for repeatable workflows (where supported), such as test expansion, refactor plans, or lint-and-fix routines
Example progressive prompting (now simplified via instructions):
- Create a high-level architecture overview at
docs/architecture_overview.md
- Generate a domain-specific architecture note, for example
docs/customer_api_architecture.md
- Propose a plan to increase test coverage for the Customer API
- Implement additional unit tests; mock external dependencies
- Refine tests to align with naming conventions and coding standards
With instructions and prompt files in place, the prompts can be shorter, for example:
- "Create a plan to increase the test coverage for the Customer API"
- "Implement these additional unit tests for the Customer API"
Where supported, prompt files (for example, .github/prompts/improve-test-coverage.prompt.md
) can encapsulate multi-step guidance and point to the instruction file and relevant documents.
Chunking strategies
Effective chunking balances sufficient context for the LLM with human reviewability and token limits.
Strategy 1: Domain-driven chunking
Approach: Divide the codebase by business domains or functional areas rather than technical layers.
Implementation:
- Group related features, entities, and business logic together
- Chunk by customer-facing features (for example, authentication, billing, user management)
- Maintain domain boundaries to preserve business context
Benefits:
- Preserves business logic relationships
- Reduces cross-domain dependencies in prompts
- Enables domain experts to guide AI assistance effectively
Example structure:
/customer-domain/
├── api/
├── services/
├── models/
└── tests/
/billing-domain/
├── api/
├── services/
├── models/
└── tests/
Strategy 2: Architectural layer chunking
Approach: Separate code by architectural concerns (presentation, business logic, data access).
Implementation:
- Process one architectural layer at a time
- Maintain interface contracts between layers
- Focus AI attention on layer-specific patterns and conventions
Benefits:
- Clear separation of concerns for AI processing
- Consistent patterns within each layer
- Easier to validate architectural compliance
Example workflow:
- Transform data access layer (repositories, DAOs)
- Update business logic layer (services, domain models)
- Modify presentation layer (controllers, views)
- Update cross-cutting concerns (logging, security)
Strategy 3: Dependency-aware chunking
Approach: Chunk code based on dependency relationships to minimise coupling issues.
Implementation:
- Start with leaf nodes (no dependencies)
- Work upwards through the dependency tree
- Use dependency analysis tools to identify optimal chunk boundaries
Benefits:
- Reduces compilation and runtime errors
- Maintains system stability during transformation
- Enables incremental testing and validation
Copilot integration:
- Use dependency graphs referenced from
.github/copilot-instructions.md
- Include dependency documentation for context
- Prompt for dependency impact analysis
Strategy 4: File size and complexity chunking
Approach: Divide based on file size, cyclomatic complexity, or lines of code.
Implementation:
- Target 200–500 lines of code per chunk for practical AI processing
- Break down complex files before transformation
- Group simple, related files together
Benefits:
- Stays within practical token limits
- Reduces cognitive load for review
- Enables focused AI suggestions
Practical guidelines:
- Large files (>1000 LOC): break into smaller modules first
- Medium files (200–1000 LOC): process individually
- Small files (<200 LOC): group by relationship
Strategy 5: Test-driven chunking
Approach: Organise chunks around testable units and existing test boundaries.
Implementation:
- Use existing test suites to define chunk boundaries
- Ensure each chunk includes its corresponding tests
- Maintain test coverage throughout transformation
Benefits:
- Preserves validation capability
- Enables continuous verification
- Maintains code behaviour contracts
Copilot workflow:
- Include existing tests in chunk context
- Generate new tests alongside code changes
- Validate transformed code against test suites
Strategy 6: API and interface chunking
Approach: Chunk around stable API boundaries and public interfaces.
Implementation:
- Group code by public API endpoints
- Maintain interface contracts during transformation
- Process internal implementation separately from public interfaces
Benefits:
- Preserves external contracts
- Enables incremental deployment
- Reduces breaking changes
Example for REST APIs:
- Chunk by endpoint groups (
/api/users/*
,/api/orders/*
) - Include route definitions, handlers, and related business logic
- Transform supporting services in separate chunks
Strategy 7: Timeline-based chunking
Approach: Divide work by development phases or sprint boundaries.
Implementation:
- Align chunks with delivery milestones
- Prioritise high-value or high-risk areas first
- Enable parallel development streams
Benefits:
- Maintains development velocity
- Enables risk management
- Supports agile delivery practices
Copilot planning:
- Create phase-specific instruction files
- Use milestone-based prompt libraries
- Track transformation progress across phases
Chunking decision matrix
When selecting a chunking strategy, consider:
Factor | Domain | Layer | Dependency | Size | Test | API | Timeline |
---|---|---|---|---|---|---|---|
Business logic preservation | ⭐⭐⭐ | ⭐⭐ | ⭐ | ⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
Technical complexity | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ | ⭐⭐ | ⭐ |
Team coordination | ⭐⭐⭐ | ⭐⭐ | ⭐ | ⭐ | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
AI context quality | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
Risk management | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
Combining strategies
In practice, successful transformations often combine multiple strategies:
- Phase 1: use dependency-aware chunking to identify transformation order
- Phase 2: apply domain-driven chunking for business logic areas
- Phase 3: use API chunking for public interfaces
- Phase 4: apply file-size chunking for remaining components
Copilot-specific considerations
Token window optimisation:
- Include only essential context files per chunk
- Use
.github/copilot-instructions.md
to provide background context - Reference documentation rather than including full specifications
Prompt efficiency:
- Create chunk-specific prompt templates or prompt files
- Maintain chunk documentation for consistent AI context
Quality assurance:
- Include validation criteria in chunk definitions
- Use automated testing to verify chunk boundaries
- Document chunk relationships for human reviewers
Documentation strategies
Documentation is a multiplier for Copilot effectiveness. Prioritise high-signal, low-noise artefacts that the model can reference consistently:
- Architecture overview:
docs/architecture.md
with system context, bounded contexts, major data flows - Module READMEs: local purpose, key entry points, dependencies, and testing instructions
- API contracts: OpenAPI/Swagger, GraphQL schemas, protobuf/IDLs kept alongside implementations
- Coding standards: language- and framework-specific conventions; naming, error handling, logging
- Testing conventions: unit/integration/contract strategies; fixtures, mocks, and environment guidance
- Dependency documentation: diagrams or generated graphs, version policies, deprecation timelines
- Decision records: lightweight ADRs for significant architectural choices
Practices:
- Co-locate docs with code; link from
.github/copilot-instructions.md
- Keep documents concise; prefer links and references over duplication
- Automate generation where possible (for example, API specs, dependency graphs)
- Treat documentation updates as part of the definition of done
Key Takeaways
- Indexing (local and remote) improves suggestion relevance; pair it with high-signal documentation.
- Instruction and prompt files reduce prompt length and increase consistency across teams.
- Chunking is essential: choose boundaries that protect behaviour and fit context limits.
- Keep changes small, testable, and reviewable; encode validation into your workflow.
- Documentation is part of the system: make it discoverable, current, and referenced by Copilot.