Chunker Module¶

The main chunking functionality.

Main Functions¶

chunker.chunk_file(path, language, extract_metadata=True, include_retrieval_metadata=False)[source]¶

Parse the file and return a list of CodeChunk.

Parameters:

path (str | Path) – Path to the file to chunk
language (str) – Programming language
extract_metadata (bool) – Whether to extract metadata (default: True)
include_retrieval_metadata (bool) – Whether to add retrieval-oriented metadata

Return type:

list[CodeChunk]

Returns:

List of CodeChunk objects with optional metadata

chunker.chunk_text_with_token_limit(text, language, max_tokens, file_path='', model='gpt-4', extract_metadata=True, include_retrieval_metadata=False)[source]¶

Parse text and return chunks that respect token limits.

This function chunks code using tree-sitter and ensures no chunk exceeds the specified token limit. Large chunks are automatically split while preserving code structure when possible.

Parameters:

text (str) – Source code text to chunk
language (str) – Programming language
max_tokens (int) – Maximum tokens per chunk
file_path (str) – Path to the file (optional)
model (str) – Tokenizer model to use (default: “gpt-4”)
extract_metadata (bool) – Whether to extract metadata (default: True)
include_retrieval_metadata (bool) – Whether to add retrieval-oriented metadata

Return type:

list[CodeChunk]

Returns:

List of CodeChunk objects with token counts in metadata

chunker.chunk_file_streaming(path, language, include_retrieval_metadata=False)[source]¶

Stream chunks from a file without loading everything into memory.

Return type:: Iterator[CodeChunk]

chunker.chunk_directory(directory, language, extensions=None, num_workers=None, use_cache=True, use_streaming=False)¶

Convenience function to process a directory in parallel.

Return type:: dict[Path, list[CodeChunk]]

Classes¶

class chunker.CodeChunk(language, file_path, node_type, start_line, end_line, byte_start, byte_end, parent_context, content, chunk_id='', parent_chunk_id=None, references=<factory>, dependencies=<factory>, metadata=<factory>, node_id='', file_id='', symbol_id=None, parent_route=<factory>, qualified_route=<factory>, definition_id='')[source]¶

Bases: object

language: str¶

file_path: str¶

node_type: str¶

start_line: int¶

end_line: int¶

byte_start: int¶

byte_end: int¶

parent_context: str¶

content: str¶

chunk_id: str = ''¶

parent_chunk_id: str | None = None¶

references: list[str]¶

dependencies: list[str]¶

metadata: dict[str, Any]¶

node_id: str = ''¶

file_id: str = ''¶

symbol_id: str | None = None¶

parent_route: list[str]¶

qualified_route: list[str]¶

definition_id: str = ''¶

generate_id()[source]¶

Generate a stable ID using file/language/route/text hash.

Return type:: str

__init__(language, file_path, node_type, start_line, end_line, byte_start, byte_end, parent_context, content, chunk_id='', parent_chunk_id=None, references=<factory>, dependencies=<factory>, metadata=<factory>, node_id='', file_id='', symbol_id=None, parent_route=<factory>, qualified_route=<factory>, definition_id='')¶

class chunker.ChunkerConfig(config_path=None, use_env_vars=True)[source]¶

Bases: object

Configuration manager for the chunker system.

Supports environment variable expansion and overrides: - ${VAR} or ${VAR:default} syntax in config files - CHUNKER_* environment variables override config values

DEFAULT_CONFIG_FILENAME = 'chunker.config'¶

SUPPORTED_FORMATS: ClassVar[set[str]] = {'.json', '.toml', '.yaml', '.yml'}¶

ENV_PREFIX = 'CHUNKER_'¶

ENV_VAR_PATTERN = re.compile('\\$\\{([^}]+)\\}')¶

__init__(config_path=None, use_env_vars=True)[source]¶

classmethod find_config(start_path=None)[source]¶

Find configuration file starting from the given path.

Return type:: Path | None

load(config_path)[source]¶

Load configuration from file.

Return type:: None

save(config_path=None)[source]¶

Save configuration to file.

Note: For TOML output, requires the optional ‘tomli-w’ package. Install with: pip install tomli-w

Return type:: None

get_plugin_config(language)[source]¶

Get configuration for a specific language plugin.

Return type:: PluginConfig

set_plugin_config(language, config)[source]¶

Set configuration for a specific language plugin.

Return type:: None

add_plugin_directory(directory)[source]¶

Add a plugin directory.

Return type:: None

remove_plugin_directory(directory)[source]¶

Remove a plugin directory.

Return type:: None

classmethod create_example_config(config_path)[source]¶

Create an example configuration file.

Return type:: None

classmethod get_env_var_info()[source]¶

Get information about supported environment variables.

Return type:: dict[str, str]

Examples¶

Basic usage:

from chunker import chunk_file

# Chunk a Python file
chunks = chunk_file("example.py", language="python")

for chunk in chunks:
    print(f"{chunk.node_type}: {chunk.start_line}-{chunk.end_line}")

Streaming:

from chunker import chunk_file_streaming

for chunk in chunk_file_streaming("large_file.py", language="python"):
    print(f"{chunk.node_type}: {chunk.content[:50]}...")

Parallel processing:

from chunker import chunk_directory

results = chunk_directory("src/", language="python", workers=4)

for result in results:
    print(f"{result.file_path}: {len(result.chunks)} chunks")

Chunker Module¶

Main Functions¶

Classes¶

Examples¶

TreeSitter Chunker

Navigation

Related Topics