Loading Datasets¶
DeepFabric outputs standard JSONL files compatible with HuggingFace datasets.
From Local File¶
from datasets import load_dataset
dataset = load_dataset("json", data_files="dataset.jsonl", split="train")
From HuggingFace Hub¶
Upload first:
Then load:
Train/Test Split¶
splits = dataset.train_test_split(test_size=0.1, seed=42)
train_ds = splits["train"]
eval_ds = splits["test"]
Accessing Fields¶
for sample in dataset:
messages = sample["messages"]
reasoning = sample.get("reasoning")
tools = sample.get("tools")
# Messages structure
for msg in messages:
role = msg["role"] # system, user, assistant, tool
content = msg["content"]
tool_calls = msg.get("tool_calls")
Filtering¶
Keep only samples with tool calls:
Filter by reasoning style:
Shuffling¶
Streaming Large Datasets¶
For datasets that don't fit in memory:
dataset = load_dataset(
"your-username/large-dataset",
split="train",
streaming=True
)
for sample in dataset:
# Process one at a time
process(sample)
Combining Datasets¶
Merge multiple DeepFabric datasets:
from datasets import concatenate_datasets
basic = load_dataset("user/basic-ds", split="train")
agent = load_dataset("user/agent-ds", split="train")
combined = concatenate_datasets([basic, agent])
combined = combined.shuffle(seed=42)