R
For ReductoAI Platform

Chunking For Embeddings

Reducto · Reducto

Document Ingestion & Parsing for AI — Reducto

Reducto evals — Chunking for Embeddings (relift v3 InfraRed)

About Reducto

Reducto is a document ingestion platform for AI pipelines that turns complex documents (PDFs, scans, spreadsheets) into clean, structured, layout-aware data. Its API parses documents into Markdown and typed content blocks, extracts structured fields against a user-defined schema with source citations, and splits bundled files into their constituent documents — feeding retrieval-augmented generation and document-automation workflows.

Employees

~50 (approx — verify)

Industry

Document AI / Data Ingestion

Headquarters

San Francisco, CA (verify)

Website

reducto.ai

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

The integrator chunks parsed output by a fixed 1,000-character window, slicing mid-sentence and mid-table, instead of using Reducto's layout-aware block boundaries.

Chunk on semantic/layout boundaries (section, paragraph, table) surfaced by the parse output rather than a blind character window that splits sentences, tables, and list items. Layout-aware chunks improve retrieval precision. Where a hard size cap is needed, split at the nearest block boundary, not…

Pass / FailAi Platformhigh
02

A 200-row table is embedded as one giant chunk that blows the embedding model's context, or split row-by-row losing the header context.

Chunk large tables deliberately: keep the header with each row-group so a retrieved slice retains column semantics, and bound chunk size to the embedding model's context. Neither one-giant-chunk nor headerless-row-shards serves retrieval. Consider a row-group + header-prefix strategy.

Pass / FailAi Platformmedium
03

Many documents share boilerplate (terms, disclaimers). The integrator embeds every near-identical boilerplate chunk, flooding retrieval with duplicates and crowding out distinctive content.

Deduplicate near-identical boilerplate chunks (exact-hash + near-dup detection) before or during indexing so retrieval surfaces distinctive content, not the same disclaimer 400 times. Keep one canonical copy with references. Measure the duplicate rate in the index.

Pass / FailAi Platformlow

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Reducto
  • Ai Platform
  • Chunking For Embeddings

Recommended for

ReductoReducto customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.