Chunker Module¶
The main chunking functionality.
Main Functions¶
- chunker.chunk_file(path, language, extract_metadata=True, include_retrieval_metadata=False)[source]¶
Parse the file and return a list of CodeChunk.
- Parameters:
- Return type:
- Returns:
List of CodeChunk objects with optional metadata
- chunker.chunk_text_with_token_limit(text, language, max_tokens, file_path='', model='gpt-4', extract_metadata=True, include_retrieval_metadata=False)[source]¶
Parse text and return chunks that respect token limits.
This function chunks code using tree-sitter and ensures no chunk exceeds the specified token limit. Large chunks are automatically split while preserving code structure when possible.
- Parameters:
text (
str) – Source code text to chunklanguage (
str) – Programming languagemax_tokens (
int) – Maximum tokens per chunkfile_path (
str) – Path to the file (optional)model (
str) – Tokenizer model to use (default: “gpt-4”)extract_metadata (
bool) – Whether to extract metadata (default: True)include_retrieval_metadata (
bool) – Whether to add retrieval-oriented metadata
- Return type:
- Returns:
List of CodeChunk objects with token counts in metadata
Classes¶
- class chunker.CodeChunk(language, file_path, node_type, start_line, end_line, byte_start, byte_end, parent_context, content, chunk_id='', parent_chunk_id=None, references=<factory>, dependencies=<factory>, metadata=<factory>, node_id='', file_id='', symbol_id=None, parent_route=<factory>, qualified_route=<factory>, definition_id='')[source]¶
Bases:
object- __init__(language, file_path, node_type, start_line, end_line, byte_start, byte_end, parent_context, content, chunk_id='', parent_chunk_id=None, references=<factory>, dependencies=<factory>, metadata=<factory>, node_id='', file_id='', symbol_id=None, parent_route=<factory>, qualified_route=<factory>, definition_id='')¶
- class chunker.ChunkerConfig(config_path=None, use_env_vars=True)[source]¶
Bases:
objectConfiguration manager for the chunker system.
Supports environment variable expansion and overrides: - ${VAR} or ${VAR:default} syntax in config files - CHUNKER_* environment variables override config values
- DEFAULT_CONFIG_FILENAME = 'chunker.config'¶
- ENV_PREFIX = 'CHUNKER_'¶
- ENV_VAR_PATTERN = re.compile('\\$\\{([^}]+)\\}')¶
- classmethod find_config(start_path=None)[source]¶
Find configuration file starting from the given path.
- save(config_path=None)[source]¶
Save configuration to file.
Note: For TOML output, requires the optional ‘tomli-w’ package. Install with: pip install tomli-w
- Return type:
- get_plugin_config(language)[source]¶
Get configuration for a specific language plugin.
- Return type:
PluginConfig
- set_plugin_config(language, config)[source]¶
Set configuration for a specific language plugin.
- Return type:
Examples¶
Basic usage:
from chunker import chunk_file
# Chunk a Python file
chunks = chunk_file("example.py", language="python")
for chunk in chunks:
print(f"{chunk.node_type}: {chunk.start_line}-{chunk.end_line}")
Streaming:
from chunker import chunk_file_streaming
for chunk in chunk_file_streaming("large_file.py", language="python"):
print(f"{chunk.node_type}: {chunk.content[:50]}...")
Parallel processing:
from chunker import chunk_directory
results = chunk_directory("src/", language="python", workers=4)
for result in results:
print(f"{result.file_path}: {len(result.chunks)} chunks")