Chunker Module

The main chunking functionality.

Main Functions

chunker.chunk_file(path, language, extract_metadata=True, include_retrieval_metadata=False)[source]

Parse the file and return a list of CodeChunk.

Parameters:
  • path (str | Path) – Path to the file to chunk

  • language (str) – Programming language

  • extract_metadata (bool) – Whether to extract metadata (default: True)

  • include_retrieval_metadata (bool) – Whether to add retrieval-oriented metadata

Return type:

list[CodeChunk]

Returns:

List of CodeChunk objects with optional metadata

chunker.chunk_text_with_token_limit(text, language, max_tokens, file_path='', model='gpt-4', extract_metadata=True, include_retrieval_metadata=False)[source]

Parse text and return chunks that respect token limits.

This function chunks code using tree-sitter and ensures no chunk exceeds the specified token limit. Large chunks are automatically split while preserving code structure when possible.

Parameters:
  • text (str) – Source code text to chunk

  • language (str) – Programming language

  • max_tokens (int) – Maximum tokens per chunk

  • file_path (str) – Path to the file (optional)

  • model (str) – Tokenizer model to use (default: “gpt-4”)

  • extract_metadata (bool) – Whether to extract metadata (default: True)

  • include_retrieval_metadata (bool) – Whether to add retrieval-oriented metadata

Return type:

list[CodeChunk]

Returns:

List of CodeChunk objects with token counts in metadata

chunker.chunk_file_streaming(path, language, include_retrieval_metadata=False)[source]

Stream chunks from a file without loading everything into memory.

Return type:

Iterator[CodeChunk]

chunker.chunk_directory(directory, language, extensions=None, num_workers=None, use_cache=True, use_streaming=False)

Convenience function to process a directory in parallel.

Return type:

dict[Path, list[CodeChunk]]

Classes

class chunker.CodeChunk(language, file_path, node_type, start_line, end_line, byte_start, byte_end, parent_context, content, chunk_id='', parent_chunk_id=None, references=<factory>, dependencies=<factory>, metadata=<factory>, node_id='', file_id='', symbol_id=None, parent_route=<factory>, qualified_route=<factory>, definition_id='')[source]

Bases: object

language: str
file_path: str
node_type: str
start_line: int
end_line: int
byte_start: int
byte_end: int
parent_context: str
content: str
chunk_id: str = ''
parent_chunk_id: str | None = None
references: list[str]
dependencies: list[str]
metadata: dict[str, Any]
node_id: str = ''
file_id: str = ''
symbol_id: str | None = None
parent_route: list[str]
qualified_route: list[str]
definition_id: str = ''
generate_id()[source]

Generate a stable ID using file/language/route/text hash.

Return type:

str

__init__(language, file_path, node_type, start_line, end_line, byte_start, byte_end, parent_context, content, chunk_id='', parent_chunk_id=None, references=<factory>, dependencies=<factory>, metadata=<factory>, node_id='', file_id='', symbol_id=None, parent_route=<factory>, qualified_route=<factory>, definition_id='')
class chunker.ChunkerConfig(config_path=None, use_env_vars=True)[source]

Bases: object

Configuration manager for the chunker system.

Supports environment variable expansion and overrides: - ${VAR} or ${VAR:default} syntax in config files - CHUNKER_* environment variables override config values

DEFAULT_CONFIG_FILENAME = 'chunker.config'
SUPPORTED_FORMATS: ClassVar[set[str]] = {'.json', '.toml', '.yaml', '.yml'}
ENV_PREFIX = 'CHUNKER_'
ENV_VAR_PATTERN = re.compile('\\$\\{([^}]+)\\}')
__init__(config_path=None, use_env_vars=True)[source]
classmethod find_config(start_path=None)[source]

Find configuration file starting from the given path.

Return type:

Path | None

load(config_path)[source]

Load configuration from file.

Return type:

None

save(config_path=None)[source]

Save configuration to file.

Note: For TOML output, requires the optional ‘tomli-w’ package. Install with: pip install tomli-w

Return type:

None

get_plugin_config(language)[source]

Get configuration for a specific language plugin.

Return type:

PluginConfig

set_plugin_config(language, config)[source]

Set configuration for a specific language plugin.

Return type:

None

add_plugin_directory(directory)[source]

Add a plugin directory.

Return type:

None

remove_plugin_directory(directory)[source]

Remove a plugin directory.

Return type:

None

classmethod create_example_config(config_path)[source]

Create an example configuration file.

Return type:

None

classmethod get_env_var_info()[source]

Get information about supported environment variables.

Return type:

dict[str, str]

Examples

Basic usage:

from chunker import chunk_file

# Chunk a Python file
chunks = chunk_file("example.py", language="python")

for chunk in chunks:
    print(f"{chunk.node_type}: {chunk.start_line}-{chunk.end_line}")

Streaming:

from chunker import chunk_file_streaming

for chunk in chunk_file_streaming("large_file.py", language="python"):
    print(f"{chunk.node_type}: {chunk.content[:50]}...")

Parallel processing:

from chunker import chunk_directory

results = chunk_directory("src/", language="python", workers=4)

for result in results:
    print(f"{result.file_path}: {len(result.chunks)} chunks")