DataPrep-Bench: Benchmarking LLMs as Training Data Preparators

01Abstract

The quality of training data fundamentally determines the capabilities of large language models (LLMs). As the community increasingly relies on LLMs, agents, and data-centric workflows to produce and curate training corpora, a foundational question emerges: how well can these systems actually prepare training data end to end? Despite the rapid proliferation of LLM-driven data preparation techniques, no unified benchmark exists to systematically measure their effectiveness.

We view LLM-driven data preparation as comprising two complementary capabilities: data construction, which transforms raw sources into high-quality training data, and data quality evaluation, which predicts the training value of candidate datasets before downstream training. We introduce DataPrep-Bench, the first comprehensive benchmark that jointly evaluates both capabilities as first-class targets under a unified, downstream-grounded protocol.

DataPrep-Bench is organized around two tracks. (1) Data Construction takes low-quality or otherwise non-trainable sources (e.g., domain books) as input and asks agents or workflows to transform them into supervised training data; the produced data is evaluated end to end by the downstream performance of models fine-tuned on it. We release Data-Construction-Skill as a strong baseline. (2) Data Quality Evaluation takes mainstream candidate training datasets and produces scalar scores; its goal is to measure whether a scoring function is linearly predictive of downstream utility. We release the Distributional Alignment Score (DAS) as a strong baseline. Across multiple domains and architectures, DAS achieves strong overall correlation with downstream performance and outperforms existing quality-, diversity-, and heuristic-based evaluators in most settings.

02Framework Overview

DataPrep-Bench evaluates LLM-driven data preparation: the use of LLMs, agents, and data workflows to produce or assess training data. It covers two capabilities: Data Construction, which turns raw domain sources into supervised training data, and Data Quality Evaluation, which predicts which candidate datasets are likely to improve downstream models before training. Both tracks are tested under shared domains, base models, training protocols, and downstream benchmarks, so methods are compared by their actual downstream impact.

Domains

Math

Science

Medical

Finance

Law

General Text

Overall framework of DataPrep-Bench — The overall framework of DataPrep-Bench.

How DataPrep-Bench differs from prior benchmarks and data tooling

Prior Work	Focus	LLM-as-Preparator	Construction	Quality Eval.	Downstream-Grounded
DataComp	Image–text filtering	No	—	Filtering	Yes
DataComp-LM	Text corpus curation	No	—	Curation	Yes
DataPerf	Selection / debugging	No	—	Selection	Yes
DCBENCH	Tabular tasks	No	—	Slice/feature	Yes
Data-Juicer / DataFlow	Data processing tooling	No comparative benchmark	—	—	—
DataPrep-Bench (ours)	SFT data preparation	LLMs · Workflows · Agents · Skills	Yes	Yes	Yes (end-to-end)

Existing benchmarks mainly evaluate curation or selection over pre-existing corpora, while data processing systems provide modular tooling without benchmarking the comparative effectiveness of different preparators. DataPrep-Bench directly evaluates LLMs and agents as data preparators across construction and quality evaluation, grounded in end-to-end downstream performance.

03Two Evaluation Tracks

Both tracks ask the same practical question: does a data preparation decision lead to a better downstream model? They differ in where the decision happens. One track evaluates methods that create training data from raw sources; the other evaluates metrics that score existing candidate datasets before training.

Track 1

Data Construction

Can a method turn raw domain materials into useful supervised fine-tuning data?

Input

Domain books, manuals, and long-form knowledge sources converted into a shared Markdown format.

Method Output

A supervised question-answer dataset synthesized from those raw sources.

Judgment

Fine-tune the same base model with the constructed data plus Dolly-15k, then evaluate on held-out domain benchmarks.

Released Baseline

Data-Construction-Skill, a skill-guided agent with reusable schemas, filtering rules, coverage checks, and validation utilities.

Track 2

Data Quality Evaluation

Can a metric predict which candidate datasets will improve downstream performance before fine-tuning?

Input

Public SFT candidate pools for each domain, mixing in-domain and out-of-domain datasets, with optional domain proxies.

Metric Output

One utility score for each candidate dataset, computed before any downstream training is run.

Judgment

Compare the metric scores against released ground-truth performance records from models fine-tuned on the same candidates.

Released Baseline

Distributional Alignment Score (DAS), a training-free metric that measures how closely a candidate matches a domain proxy.

04Strong Baselines

For Track 1 · Data Construction

Data-Construction-Skill

A skill-guided agentic method for turning long-form domain documents into reusable QA-style SFT data. It is designed for corpus-scale construction, where a single prompt is not enough to manage decomposition, consistency, coverage, validation, and resumable execution.

Core idea

The agent still plans over the corpus and executes the construction work, but the reusable skill layer defines what counts as valid supervision. The skill packages task instructions, output schemas, sample-type definitions, filtering rules, coverage requirements, and validation utilities into a structured interface for the agent.

This makes the method more than a one-off QA-generation prompt: it is a controllable framework for extracting, reformulating, checking, and tracking supervision across hundreds of pages of expert-authored content.

What the skill controls

Which chunks contain reusable domain knowledge
Which sample types are valid for each chunk
Whether questions are faithful and self-contained
Which malformed, duplicated, or hallucinated samples are rejected
How coverage and resumability are recorded

Construction pipeline

1

Chunk long documents

Split books and manuals into semantically coherent chunks that are small enough for agent processing but still locally complete.

2

Triage reusable knowledge

Keep chunks that teach definitions, rules, mechanisms, conditions, exceptions, comparisons, or causal links; skip noisy or navigational content.

3

Generate three QA forms

Create concept QA, source-grounded reasoning QA, and simple case-application QA when the chunk supports them.

4

Validate and track coverage

Remove document-relative phrasing and weak samples, then maintain chunk-level records for coverage checking and resumable runs.

Why it matters: Data-Construction-Skill preserves the flexibility of agentic planning while using explicit skill-level constraints to stabilize quality, coverage, and faithfulness across long-form corpora.

For Track 2 · Data Quality Evaluation

Distributional Alignment Score (DAS)

A training-free metric for estimating whether a candidate SFT dataset is likely to help a target domain. DAS scores a candidate by measuring how closely its text distribution aligns with a domain proxy dataset.

Core idea

If a candidate dataset is distributionally close to a high-quality proxy for the target domain, it is more likely to provide useful training signal for that domain. DAS turns this intuition into a plug-in score by embedding both the candidate and the proxy with the same fixed text encoder, then measuring their distributional distance with MMD.

We use a proxy rather than the benchmark test set itself, so the metric can estimate target-domain proximity without leaking test data into dataset selection. Higher DAS means smaller proxy distance and stronger predicted downstream utility.

What DAS is testing

Can a metric rank datasets before fine-tuning?
Does domain alignment predict downstream utility?
Can we avoid benchmark contamination by using proxies?
Do metric scores correlate with released ground-truth performance records?

Scoring pipeline

1

Encode

Represent samples from the candidate dataset and the domain proxy with the same fixed text encoder for comparability.

2

Measure alignment

Compute MMD between the candidate and proxy feature distributions using a Gaussian RBF kernel.

3

Score utility

Return a higher score for candidates with smaller proxy distance, then test whether those scores predict real downstream performance.

Domain proxy datasets used by DAS

General: Infinity-Instruct Math: ODA-Math-460k Science: Logics-STEM Medical: ReasonMed Finance: Fin-o1 Law: DISC-Law-SFT

Why it matters: DAS is grounded in domain-adaptation intuition: under a fixed model family and training protocol, better alignment between training data and the target domain should reduce target-side risk. The proxy formulation makes this usable without touching benchmark test data.

05Benchmarks & Experimental Setup

The experiments instantiate the two-track benchmark under fixed resources and protocols. In Track 1, every construction method consumes the same raw sources and is judged by the downstream performance of models trained on its synthesized data. In Track 2, every quality metric scores the same candidate datasets and is judged by how well those scores predict the released ground-truth downstream performance records.

A data preparation decision is evaluated only through downstream model behavior: either by training on the produced data, or by testing whether a metric predicts the performance of training on candidate data.

Track 1 · Data Construction

Same raw sources, downstream judgment

DataFlow workflows, direct LLM generation, ReAct-style agents, and Data-Construction-Skill all consume the same domain source corpus and output synthesized SFT data.

Sources: domain books and long-form materials converted to Markdown with MinerU; General Text uses a 150 MB FineWeb sample.
Training: each synthesized dataset is mixed with Dolly-15k; the reference baseline uses Dolly-15k alone.
Models: Qwen2.5-7B and Llama-3.1-8B are fine-tuned with the same LlamaFactory recipe.

Track 2 · Data Quality Evaluation

Same candidate pools, predictive judgment

DAS and 17 DataFlow quality/diversity evaluators score the same public SFT candidate pools before downstream training.

Candidates: each domain mixes in-domain and out-of-domain public SFT datasets.
Ground truth: every candidate is actually fine-tuned and evaluated to obtain downstream utility.
Models: Qwen2.5-7B, Llama-3.1-8B, and Mistral-7B-v0.3 test whether metric correlations are model-dependent.

Shared domains and target benchmarks

General: MMLU-Redux Math: AIME24, AMC23, Gaokao2024, GSM8K, MATH, MinervaMath, OlympiadBench Science: MMLU-STEM, MMLU-Pro, GPQA, SuperGPQA, ChemBench, PIQA, SciBench Medical: MedR-Bench, MedMCQA, MedCaseReasoning Finance: XFinBench, FinEval-KR, CPA-KQA Law: LegalBench, LexGLUE

06Track 1 · Data Construction Results

Track 1 asks whether synthesized domain data actually helps after fine-tuning. We compare DataFlow workflows, direct LLM generation, ReAct-style agents, and our skill-guided agent baseline. Each method receives the same raw source corpus; its output is mixed with Dolly-15k, used to fine-tune Qwen2.5-7B or Llama-3.1-8B, and evaluated on held-out benchmarks across six domains.

What Qwen2.5-7B shows

No method family dominates. On Qwen2.5-7B, agents achieve the strongest averages in Math and Medical, DataFlow is strongest in Law, and different methods win different sub-benchmarks.
Skill is competitive but not universal. It ties for best on General Text (78.2), is best on Minerva-Math (14.0), and stays near the top on Math Avg (24.1).
Synthetic data is not automatically beneficial. The no-synthetic Dolly-only baseline remains strong in several domains, especially Science (27.9 Avg).
Finance and Medicine reward different construction behavior. DataFlow-Skill is strongest on Finance (64.8), while Gemini agent is strongest on Medicine (43.8).

Table A. Qwen2.5-7B — Math, General, and Finance

Generator	Math								General		Finance
Generator	GSM8K	AMC23	AIME24	M-Math	OB	M500	GK24	Avg	MMLU-R	Avg	CPA-KQA	FinEval-KR	XFinBench	Avg
No synthetic training	69.9	17.5	0.0	10.7	10.7	39.8	16.5	23.6	77.7	77.7	57.6	59.4	56.3	57.8
DataFlow-based Generators
DataFlow	56.5	7.5	0.0	7.7	7.9	27.4	17.6	17.8	77.9	77.9	51.0	54.5	59.3	54.9
DataFlow-Skill	56.7	10.0	0.0	8.8	6.2	22.8	28.6	19.0	76.5	76.5	60.0	65.4	68.9	64.8
LLM-based Generators
Claude Opus 4.6	68.8	12.5	0.0	8.8	9.8	35.5	14.3	21.4	78.2	78.2	37.6	39.6	55.9	44.4
Gemini 3.0 Pro	72.1	20.0	0.0	10.7	11.4	37.9	15.4	23.9	77.8	77.8	48.6	49.5	58.4	52.2
GPT-5.2	66.7	17.5	0.0	7.7	11.1	32.7	14.3	21.4	77.9	77.9	47.1	43.6	53.6	48.1
Agent-based Generators
Qwen3.5-Plus	72.7	22.5	3.3	11.0	11.6	38.7	16.5	25.2	77.6	77.6	48.1	49.5	55.6	51.1
GLM-4.7	71.4	22.5	3.3	9.6	11.6	37.9	20.9	25.3	77.3	77.3	51.4	55.4	53.1	53.3
Claude Opus 4.6	69.4	10.0	0.0	11.0	8.9	33.8	19.8	21.8	75.9	75.9	32.4	39.6	53.8	41.9
Gemini 3.0 Pro	70.1	15.0	0.0	8.8	10.8	35.7	20.9	23.0	77.7	77.7	53.3	52.5	55.4	53.7
GPT-5.2	69.6	25.0	3.3	8.8	9.9	35.6	26.4	25.5	77.7	77.7	39.1	39.6	51.0	43.2
GPT-5.3-codex	70.8	15.0	0.0	11.0	11.7	38.5	13.2	22.9	77.6	77.6	58.6	63.4	55.9	59.3
Skill (Claude Opus 4.6)	72.6	17.5	3.3	14.0	11.1	36.8	13.2	24.1	78.2	78.2	57.6	53.5	55.4	55.5

Abbreviations: M-Math = Minerva-Math · OB = OlympiadBench · M500 = MATH-500 · GK24 = Gaokao 2024 · MMLU-R = MMLU-Redux · CPA-KQA = CPA-KQA · FEKR = FinEval-KR · XFB = XFinBench. Source: Table synthetic_data_mgf_qwen in the paper. Bold marks the best in each column.

Table B. Qwen2.5-7B — Law, Medical, and Science

Generator	Law			Medical				Science
Generator	LegalBench	LexGLUE	Avg	MedCaseR.	MedMCQA	MedR-Bench	Avg	MMLU-STEM	MMLU-Pro	GPQA	SuperGPQA	ChemBench	PIQA	SciBench	Avg
No synthetic training	86.9	62.0	74.5	13.6	27.4	67.8	36.3	47.5	28.6	21.1	15.7	24.3	52.7	5.2	27.9
DataFlow-based Generators
DataFlow	89.7	64.8	77.2	9.9	29.4	63.6	34.3	37.9	23.9	20.7	11.8	15.7	28.8	2.0	20.1
DataFlow-Skill	92.0	57.4	74.7	11.9	24.1	65.6	33.9	44.2	24.6	18.0	13.0	18.8	43.5	2.0	23.4
LLM-based Generators
Claude Opus 4.6	88.0	63.2	75.6	13.9	8.1	66.8	29.6	38.7	24.6	22.4	13.1	19.3	40.9	2.7	23.1
Gemini 3.0 Pro	85.9	63.5	74.7	12.2	6.0	69.1	29.1	39.7	25.7	19.7	12.7	21.2	44.0	3.3	23.8
GPT-5.2	89.2	60.9	75.0	10.6	23.3	66.0	33.3	38.3	24.6	22.9	12.8	17.9	38.1	2.0	22.4
Agent-based Generators
Qwen3.5-Plus	90.2	61.0	75.6	16.5	16.5	68.3	33.8	47.5	26.9	21.7	14.8	22.9	48.6	5.3	26.8
GLM-4.7	84.8	61.4	73.1	15.2	10.6	66.8	30.9	42.0	25.7	18.9	12.7	19.4	43.6	2.3	23.5
Claude Opus 4.6	65.2	48.8	57.0	16.6	50.7	54.3	40.5	36.6	24.1	19.0	12.9	19.2	44.1	2.2	22.6
Gemini 3.0 Pro	57.5	55.5	56.5	15.8	52.3	63.4	43.8	46.0	27.6	22.4	14.7	20.3	50.0	2.5	26.2
GPT-5.2	61.7	51.9	56.8	14.7	45.0	59.0	39.6	44.0	25.3	21.1	14.0	20.2	46.4	2.8	24.8
GPT-5.3-codex	87.8	63.4	75.6	13.6	20.3	70.0	34.6	35.9	22.4	16.4	11.2	16.5	33.1	3.4	19.8
Skill (Claude Opus 4.6)	85.5	61.7	73.6	9.9	15.3	65.4	30.2	39.8	22.8	19.2	12.2	17.1	36.9	1.7	21.4

Abbreviations: MCR = MedCaseReasoning · MMCQA = MedMCQA · MRB = MedR-Bench · MSTEM = MMLU-STEM · MPRO = MMLU-Pro · SGPQA = SuperGPQA · CB = ChemBench · SB = SciBench. Source: Table synthetic_data_slm_qwen in the paper.

What Llama-3.1-8B shows

The clearest positive signal is Finance. DataFlow-Skill reaches 36.5 and Data-Construction-Skill reaches 34.2, both substantially above the Dolly-only baseline of 15.1.
Medicine also benefits from constructed data. Claude Opus direct generation reaches 36.1, and Skill is close at 35.4.
Science is a counterexample to simple scaling. The Dolly-only baseline is best at 13.2, and most generated datasets reduce performance.
Math and General Text are also fragile. Many generators underperform the no-synthetic baseline, showing why the benchmark must measure downstream utility rather than dataset appearance.

Table C. Llama-3.1-8B — Math, General, and Finance

Generator	Math								General		Finance
Generator	GSM8K	AMC23	AIME24	M-Math	OB	M500	GK24	Avg	MMLU-R	Avg	CPA-KQA	FinEval-KR	XFinBench	Avg
No synthetic training	33.9	10.0	0.0	6.6	3.7	13.6	14.3	11.7	67.1	67.1	12.6	12.4	20.3	15.1
DataFlow-based Generators
DataFlow	8.4	10.0	0.0	4.8	2.5	6.1	8.8	5.8	51.1	51.1	27.1	26.7	40.5	31.4
DataFlow-Skill	8.7	2.5	0.0	4.4	3.3	7.2	16.5	6.1	48.4	48.4	34.8	31.7	43.0	36.5
LLM-based Generators
Claude Opus 4.6	28.7	5.0	0.0	6.2	2.8	11.2	11.0	9.3	52.9	52.9	22.9	24.8	37.5	28.4
Gemini 3.0 Pro	30.7	7.5	0.0	8.1	3.7	12.6	14.3	11.0	50.8	50.8	23.8	27.7	42.1	31.2
GPT-5.2	26.8	2.5	0.0	7.7	4.1	10.2	13.2	9.2	47.4	47.4	25.2	24.8	32.2	27.4
Agent-based Generators
Qwen3.5-Plus	33.7	5.0	0.0	6.6	4.4	11.3	12.1	10.4	50.4	50.4	23.3	20.8	38.2	27.4
GLM-4.7	31.8	2.5	0.0	5.9	3.1	13.0	17.6	10.6	49.2	49.2	23.8	25.7	43.0	30.8
Claude Opus 4.6	16.4	5.0	0.0	6.6	2.5	7.7	8.8	6.7	47.5	47.5	23.3	25.7	37.7	28.9
Gemini 3.0 Pro	29.6	7.5	0.0	6.2	3.6	11.5	17.6	10.9	47.9	47.9	25.2	21.8	40.9	29.3
GPT-5.2	28.7	5.0	0.0	6.6	4.0	10.4	15.4	10.0	49.0	49.0	21.9	21.8	44.8	29.5
GPT-5.3-codex	29.9	2.5	0.0	5.9	2.8	12.2	14.3	9.7	51.3	51.3	23.3	25.7	41.8	30.3
Skill (Claude Opus 4.6)	16.5	7.5	0.0	4.4	3.0	9.5	11.0	7.4	51.3	51.3	36.2	31.7	34.7	34.2

Abbreviations: M-Math = Minerva-Math · OB = OlympiadBench · M500 = MATH-500 · GK24 = Gaokao 2024 · MMLU-R = MMLU-Redux · CPA-KQA = CPA-KQA · FEKR = FinEval-KR · XFB = XFinBench. Source: Table synthetic_data_mgf_llama in the paper. Bold marks the best in each column.

Table D. Llama-3.1-8B — Law, Medical, and Science

Generator	Law			Medical				Science
Generator	LegalBench	LexGLUE	Avg	MedCaseR.	MedMCQA	MedR-Bench	Avg	MMLU-STEM	MMLU-Pro	GPQA	SuperGPQA	ChemBench	PIQA	SciBench	Avg
No synthetic training	87.9	51.1	69.5	9.9	17.1	34.3	20.4	25.3	14.7	14.1	8.2	14.7	11.4	3.8	13.2
DataFlow-based Generators
DataFlow	84.0	61.3	72.7	12.3	21.3	56.1	29.9	8.5	3.7	5.1	2.5	6.1	6.4	0.3	4.7
DataFlow-Skill	85.2	61.0	73.1	12.4	33.8	61.8	36.0	5.6	1.7	3.3	1.7	4.6	4.8	0.0	3.1
LLM-based Generators
Claude Opus 4.6	85.9	56.1	71.0	16.3	27.2	64.8	36.1	17.1	10.4	10.8	6.1	9.2	9.6	2.2	9.3
Gemini 3.0 Pro	86.1	61.9	74.0	16.4	16.3	67.6	33.4	16.7	10.3	9.8	6.1	10.2	9.2	1.6	9.1
GPT-5.2	91.6	61.3	76.4	13.5	26.9	56.8	32.4	14.7	7.4	10.7	4.9	9.0	11.9	0.6	8.5
Agent-based Generators
Qwen3.5-Plus	84.2	62.3	73.2	16.7	13.7	69.9	33.4	17.3	9.5	11.1	5.1	12.1	8.9	0.9	9.3
GLM-4.7	83.8	60.7	72.2	15.6	20.0	67.1	34.2	16.9	9.9	9.1	5.4	12.0	8.4	0.6	8.9
Claude Opus 4.6	87.4	52.1	69.8	14.8	18.4	64.8	32.7	17.0	9.4	10.7	5.7	10.9	7.8	1.9	9.1
Gemini 3.0 Pro	86.3	60.6	73.5	16.5	15.9	70.2	34.2	10.1	3.8	5.0	2.3	5.9	11.0	0.3	5.5
GPT-5.2	84.5	61.3	72.9	16.8	21.3	60.1	32.7	8.1	3.3	2.9	2.2	4.4	9.1	0.2	4.3
GPT-5.3-codex	89.6	63.1	76.3	18.3	23.7	65.0	35.7	16.4	9.0	10.4	5.3	8.5	7.7	0.6	8.3
Skill (Claude Opus 4.6)	86.6	53.7	70.2	13.7	27.9	64.6	35.4	16.2	8.5	9.1	4.8	7.5	5.9	0.2	7.5

Abbreviations: MCR = MedCaseReasoning · MMCQA = MedMCQA · MRB = MedR-Bench · MSTEM = MMLU-STEM · MPRO = MMLU-Pro · SGPQA = SuperGPQA · CB = ChemBench · SB = SciBench. Source: Table synthetic_data_slm_llama in the paper.

07Track 2 · Data Quality Evaluation Results

Track 2 asks whether a data-quality metric can rank candidate datasets before paying the cost of fine-tuning. For each candidate, we compute a metric score, then compare those scores with the empirical downstream performance of models actually fine-tuned on the same candidates. A reliable metric should produce correlations with the expected sign, across domains and across base models.

How to read the correlation tables

DAS is reported through its MMD distance. Smaller MMD means better alignment, so a strong negative MMD-performance correlation corresponds to a strong positive DAS-performance correlation.
Underline marks reliability. Underlined cells are statistically significant and point in the theoretically expected direction.
Avg columns matter most. They summarize whether a metric remains useful across Qwen2.5-7B, Llama-3.1-8B, and Mistral-7B-v0.3.
The main result: Among the tested metrics, DAS has the strongest or tied strongest domain-average correlation in General, Math, Science, and Medical, while Finance and Law favor different non-DAS metrics.

negative correlation positive correlation underline = significant & theoretically consistent (p < 0.05)

Table E. General · Math · Science — full Pearson correlations & runtime

Metric	Time (s)	General				Math				Science
Metric	Time (s)	Qwen	Llama	Mistral	Avg	Qwen	Llama	Mistral	Avg	Qwen	Llama	Mistral	Avg
Distribution-based
MMD (DAS = −MMD)	306.02	−0.64	−0.65	−0.74	−0.68	−0.72	−0.93	−0.94	−0.86	−0.53	−0.82	−0.80	−0.72
Quality-based
Qurating-WritingStyle	84.65	0.17	0.36	0.38	0.30	−0.48	−0.33	−0.46	−0.42	−0.10	−0.16	0.04	−0.07
Qurating-Expertise		−0.17	−0.37	−0.42	−0.32	0.61	0.76	0.71	0.69	0.56	0.76	0.83	0.72
Qurating-FactsTrivia		0.03	0.10	0.02	0.05	−0.35	−0.04	−0.20	−0.20	−0.08	0.14	0.32	0.13
Qurating-Educational		−0.03	0.17	0.13	0.09	−0.23	0.04	−0.10	−0.10	0.01	0.17	0.34	0.18
BERTVendi	459.37	0.38	0.47	0.58	0.47	−0.69	−0.40	−0.54	−0.54	−0.41	−0.10	−0.07	−0.19
SimCSEVendi	459.37	0.30	0.46	0.55	0.44	−0.80	−0.50	−0.63	−0.65	−0.66	−0.43	−0.34	−0.48
Deita-Quality	434.99	0.06	0.46	0.18	0.23	−0.42	−0.46	−0.54	−0.47	0.08	−0.26	−0.13	−0.10
RewardModel	109.61	0.56	0.67	0.54	0.59	−0.71	−0.81	−0.85	−0.79	−0.39	−0.54	−0.32	−0.42
Superfiltering	40.05	0.13	0.25	0.30	0.23	−0.11	−0.21	−0.29	−0.21	0.02	−0.10	−0.13	−0.07
FineWebEdu	11.67	−0.43	−0.43	−0.37	−0.41	0.42	0.74	0.65	0.60	0.09	0.45	0.59	0.37
PairQual	12.34	−0.03	−0.03	−0.15	−0.07	−0.16	0.06	−0.06	−0.05	0.19	0.23	0.40	0.27
Perplexity	40.60	0.47	0.43	0.48	0.46	−0.71	−0.56	−0.62	−0.63	−0.59	−0.32	−0.44	−0.45
Diversity-based
MTLD	2.59	0.48	0.38	0.32	0.39	−0.58	−0.59	−0.67	−0.62	−0.48	−0.67	−0.49	−0.55
HD-D	2.59	0.17	−0.08	0.02	0.03	−0.34	−0.17	−0.25	−0.25	−0.64	−0.61	−0.44	−0.56
Task2Vec	7.93	−0.04	0.16	0.17	0.10	−0.48	−0.09	−0.25	−0.27	−0.65	−0.36	−0.21	−0.41
Ngram	1.06	−0.01	−0.05	0.00	−0.02	−0.31	0.12	−0.05	−0.08	−0.22	0.04	0.22	0.01
Deita-Complexity	344.34	0.52	0.25	0.26	0.34	−0.19	−0.34	−0.32	−0.28	0.05	0.27	0.32	0.22

Source: Table eval_results_gms in the paper. The MMD row is the distance form of DAS, so negative correlations indicate that higher DAS predicts better downstream performance. MMD/DAS is strongest in General (\(-0.68\)) and Math (\(-0.86\)); in Science, it is close to QuRating-Expertise and remains strong across two of three base models.

Table F. Medical · Finance · Law — full Pearson correlations

Metric	Medical				Finance				Law
Metric	Qwen	Llama	Mistral	Avg	Qwen	Llama	Mistral	Avg	Qwen	Llama	Mistral	Avg
Distribution-based
MMD (DAS = −MMD)	−0.73	−0.87	−0.72	−0.77	0.18	−0.57	−0.14	−0.18	0.50	0.17	0.40	0.36
Quality-based
Qurating-WritingStyle	−0.10	0.20	−0.06	0.01	0.03	0.13	0.33	0.16	0.63	−0.08	0.64	0.40
Qurating-Expertise	0.56	0.73	0.66	0.65	−0.04	0.13	0.60	0.23	0.72	−0.37	0.37	0.24
Qurating-FactsTrivia	0.31	0.66	0.36	0.44	−0.06	0.23	0.47	0.21	0.58	−0.15	0.70	0.37
Qurating-Educational	0.05	0.40	0.14	0.20	−0.05	0.10	0.49	0.18	0.57	−0.27	0.64	0.31
Deita-Quality	0.57	0.63	0.68	0.63	−0.18	−0.45	0.15	−0.16	0.46	−0.59	0.24	0.04
RewardModel	0.26	0.32	0.23	0.27	−0.07	−0.30	−0.06	−0.14	0.43	−0.08	0.51	0.28
Superfiltering	−0.29	−0.22	−0.43	−0.31	0.71	0.60	0.69	0.67	0.52	0.52	0.41	0.48
FineWebEdu	0.18	0.55	0.29	0.34	−0.23	0.16	0.48	0.13	0.57	−0.22	0.59	0.31
PairQual	0.44	0.71	0.53	0.56	−0.10	0.09	0.39	0.13	0.63	−0.28	0.61	0.32
Perplexity	−0.53	−0.30	−0.64	−0.49	0.56	0.59	0.35	0.50	0.06	0.33	0.53	0.31
BERTVendi	−0.64	−0.26	−0.56	−0.49	0.33	0.79	0.23	0.45	0.42	0.20	0.44	0.35
SimCSEVendi	−0.47	−0.09	−0.50	−0.35	0.13	0.54	0.16	0.28	0.28	0.15	0.74	0.39
Diversity-based
Task2Vec	0.06	0.39	−0.03	0.14	−0.26	0.20	0.29	0.08	0.12	−0.12	0.91	0.30
MTLD	0.27	0.48	0.23	0.33	0.17	0.47	0.37	0.34	0.71	0.26	0.59	0.52
HD-D	0.20	0.55	0.16	0.30	0.15	0.48	0.40	0.34	0.45	0.03	0.84	0.44
Ngram	0.12	0.54	0.18	0.28	−0.02	0.50	0.42	0.30	0.54	−0.01	0.66	0.40
Deita-Complexity	−0.10	−0.02	−0.09	−0.07	0.57	0.06	0.44	0.36	0.55	0.01	0.25	0.27

Source: Table eval_results_mfl in the paper. MMD/DAS is strongest in Medical (\(-0.77\)). In Finance, Superfiltering is strongest (\(+0.67\)); in Law, MTLD is strongest (\(+0.52\)). No single metric is reliable across every domain.

DAS Pearson correlation with downstream performance, per domain
Math
+0.86
Medical
+0.77
Science
+0.72
General Text
+0.68
Law
−0.36
Finance
+0.18

          Bar magnitude is \(|\rho|\); the displayed value is DAS's Pearson correlation with downstream performance,
          averaged across Qwen2.5-7B, Llama-3.1-8B, and Mistral-7B-v0.3 (computed as the sign-flipped MMD correlations
          reported in Tables E & F, since DAS \(=\) \(-\)MMD). Longer bars in Math, Medical, Science, General
          indicate that DAS is a strong consistent predictor; Finance is weak; in Law the direction flips.
        

08Key Findings

More synthetic data is not automatically better

Adding domain-specific synthesized data on top of Dolly-15k frequently reduces downstream performance. Since the protocol changes only the added domain data, these regressions show why data construction must be evaluated end to end.

No construction family wins everywhere

DataFlow-style workflows do well in structured settings such as Law and Finance, while agent-based methods are often competitive or strongest in reasoning-heavy domains. Direct LLM prompting is a useful simple baseline, but its results vary substantially across domains.

Skill-guided construction is strong but selective

Data-Construction-Skill is competitive in knowledge-extraction settings, tying for best on Qwen General and lifting Llama Finance from 15.1 to 34.2. Its weaker Science results suggest room for future skills that better handle open-ended scientific reasoning.

DAS is the most consistent overall quality signal

DAS strongly tracks downstream utility in Math (+0.86), Medical (+0.77), Science (+0.72), and General Text (+0.68), making distributional alignment the most consistent overall signal among the tested metric families.

Existing metrics are narrow or sign-inconsistent

Quality and diversity metrics often work in one domain but weaken or flip sign in another. Finance and Law favor different non-DAS metrics, with Superfiltering strongest in Finance and MTLD strongest in Law, so no metric is uniformly reliable.

DataPrep-Bench makes comparison reproducible

By fixing raw corpora, candidate pools, base models, training recipes, and downstream benchmarks, the benchmark turns data preparation from anecdotal comparison into a measurable testbed for agents, workflows, skills, and metrics.