- Home
- Packages Overview
- SDK Overview
Ingestion SDK
Complete ebook ingestion pipeline with metadata enrichment
The Ingestion SDK provides a complete pipeline for importing ebooks into Colibri. It handles metadata extraction, enrichment from external providers, duplicate detection, and database storage with automatic fuzzy matching to prevent duplicates.
Installation
npm install @colibri-hq/sdk
# or
pnpm add @colibri-hq/sdkArchitecture Overview
The ingestion pipeline follows this flow:
flowchart TD
A[Ebook File] --> B["Extract Metadata<br/>(EPUB, MOBI, PDF parsers)"]
B --> C[ExtractedMetadata]
C --> D["Enrich with Providers<br/>(Optional: 14+ metadata sources)"]
D --> E[Multiple MetadataRecords]
E --> F["Convert & Merge<br/>(Reconciliation & confidence scoring)"]
F --> G[Final ExtractedMetadata]
G --> H["Detect Duplicates<br/>(Checksum, ISBN, fuzzy matching)"]
H --> I["Ingest<br/>(Create Work, Edition, Asset)"]
I --> J[("Database<br/>(PostgreSQL with full-text search)")] Quick Start
Basic Ingestion
Ingest an ebook without enrichment:
import { ingestWork } from "@colibri-hq/sdk/ingestion";
import { parseEpub } from "@colibri-hq/sdk/ebooks";
// Extract metadata from file
const metadata = await parseEpub(fileBuffer);
// Ingest into database
const result = await ingestWork(database, {
file: fileBuffer,
filename: "gatsby.epub",
metadata: metadata,
userId: "user-123",
});
console.log("Created work:", result.work.id);
console.log("Created edition:", result.edition.id);
console.log("Created asset:", result.asset.id);Ingestion with Enrichment
Automatically fetch additional metadata from external sources:
import { ingestWork } from "@colibri-hq/sdk/ingestion";
import { globalProviderRegistry } from "@colibri-hq/sdk/metadata";
const result = await ingestWork(database, {
file: fileBuffer,
filename: "gatsby.epub",
metadata: extractedMetadata,
userId: "user-123",
enrich: true, // Enable enrichment
enrichProviders: globalProviderRegistry.getEnabledProviders(),
});
if (result.enrichmentResults) {
console.log("Sources used:", result.enrichmentResults.sources);
console.log("Confidence:", result.enrichmentResults.confidence);
}Metadata Extraction
Extract metadata from different ebook formats:
EPUB Files
import { parseEpub } from "@colibri-hq/sdk/ebooks";
const metadata = await parseEpub(fileBuffer);
console.log("Title:", metadata.title);
console.log(
"Authors:",
metadata.contributors?.filter((c) => c.roles.includes("aut")),
);
console.log("Cover:", metadata.coverImage);
console.log("ISBN:", metadata.identifiers?.find((i) => i.type === "isbn")?.value);Extracted data:
- Title, subtitle
- Contributors (authors, editors, translators with MARC roles)
- Publisher
- Publication date
- Language (ISO 639-1)
- Identifiers (ISBN, DOI, etc.)
- Synopsis/description
- Subjects/tags
- Cover image
- Series information
MOBI Files
import { parseMobi } from "@colibri-hq/sdk/ebooks";
const metadata = await parseMobi(fileBuffer);
console.log("Title:", metadata.title);
console.log("Creator:", metadata.creator);
console.log("ASIN:", metadata.asin);
console.log("Cover:", metadata.coverImage);Extracted data:
- Title
- Creator (author)
- Publisher
- ASIN (Amazon identifier)
- ISBN
- Language
- Cover image
PDF Files
import { parsePdf } from "@colibri-hq/sdk/ebooks";
const metadata = await parsePdf(fileBuffer);
console.log("Title:", metadata.title);
console.log("Author:", metadata.author);
console.log("Pages:", metadata.pageCount);Extracted data:
- Title
- Author
- Subject
- Keywords
- Page count
- Creation/modification dates
Metadata Enrichment
Enrich extracted metadata with data from 14+ external providers.
Basic Enrichment
import { enrichMetadata } from "@colibri-hq/sdk/ingestion";
import { globalProviderRegistry } from "@colibri-hq/sdk/metadata";
const ebookMetadata = {
title: "The Great Gatsby",
contributors: [{ name: "F. Scott Fitzgerald", roles: ["aut"] }],
identifiers: [{ type: "isbn", value: "9780743273565" }],
};
// Get all enabled providers
const providers = globalProviderRegistry.getEnabledProviders();
// Enrich metadata
const result = await enrichMetadata(ebookMetadata, providers, {
preferredLanguage: "en",
strategy: "merge-all", // Merge all provider results
timeout: 30000, // 30 second timeout
});
console.log("Enriched:", result.enriched);
console.log("Sources:", result.sources); // ['WikiData', 'OpenLibrary', 'LoC']
console.log("Confidence:", result.confidence); // 0.92Selective Provider Enrichment
Use specific providers for targeted enrichment:
import {
OpenLibraryMetadataProvider,
WikiDataMetadataProvider,
LibraryOfCongressMetadataProvider,
} from "@colibri-hq/sdk/metadata";
const providers = [
new OpenLibraryMetadataProvider(),
new WikiDataMetadataProvider(),
new LibraryOfCongressMetadataProvider(),
];
const result = await enrichMetadata(ebookMetadata, providers);Search Strategies
Different search criteria yield different confidence levels:
Best for ISBN (0.90-0.98 confidence):
- Strategy: Parallel
- Providers: WikiData + Library of Congress
- Speed: Fast (1-2 seconds)
Best for Title + Author (0.80-0.95 confidence):
- Strategy: Parallel
- Providers: WikiData + LoC + Open Library
- Speed: Moderate (2-4 seconds)
Best for Title only (0.70-0.85 confidence):
- Strategy: Sequential
- Providers: Open Library → WikiData → LoC
- Speed: Slower (4-8 seconds)
Duplicate Detection
The ingestion system automatically detects duplicates using multiple strategies.
Detection Methods
1. Exact Asset Match
Compares file checksums (SHA-256):
import { detectDuplicates } from "@colibri-hq/sdk/ingestion";
const duplicates = await detectDuplicates(database, {
checksum: "sha256-hash-of-file",
metadata: extractedMetadata,
});
if (duplicates.exactMatch) {
console.log("Identical file already exists!");
console.log("Confidence:", duplicates.exactMatch.confidence); // 1.0
}2. ISBN Match
Compares ISBN-10 or ISBN-13:
const duplicates = await detectDuplicates(database, {
metadata: { identifiers: [{ type: "isbn", value: "9780743273565" }] },
});
if (duplicates.isbnMatches.length > 0) {
console.log("Same ISBN found:", duplicates.isbnMatches);
// Could be different editions (hardcover vs paperback)
}3. Fuzzy Title/Author Match
Uses Levenshtein distance and PostgreSQL trigram similarity:
const duplicates = await detectDuplicates(database, {
metadata: {
title: "The Great Gatsby",
contributors: [{ name: "F Scott Fitzgerald", roles: ["aut"] }],
},
});
for (const match of duplicates.fuzzyMatches) {
console.log(`Title similarity: ${match.titleSimilarity}`);
console.log(`Author similarity: ${match.authorSimilarity}`);
console.log(`Overall confidence: ${match.confidence}`);
}Duplicate Handling Strategies
Configure how to handle duplicates during ingestion:
const result = await ingestWork(database, {
file: fileBuffer,
metadata: extractedMetadata,
userId: "user-123",
onDuplicateWork: "prompt", // Options: 'skip' | 'replace' | 'prompt'
onDuplicateEdition: "add-edition", // Add as new edition of existing work
onDuplicateAsset: "skip", // Skip if exact file exists
});
if (result.status === "needs-confirmation") {
console.log("Duplicate found:", result.duplicateInfo);
// Display confirmation dialog to user
}Normalization
The SDK automatically normalizes names to prevent duplicates and improve matching.
Creator Name Normalization
Handles initials, titles, suffixes, and punctuation:
import { normalizeCreatorName } from "@colibri-hq/sdk/ingestion";
normalizeCreatorName("J.K. Rowling"); // "jk rowling"
normalizeCreatorName("J. K. Rowling"); // "jk rowling"
normalizeCreatorName("Rowling, J.K."); // "jk rowling"
normalizeCreatorName("Dr. John Smith"); // "john smith"
normalizeCreatorName("Martin Luther King Jr."); // "martin luther king"
normalizeCreatorName("José García"); // "jose garcia"Supported transformations:
- Titles: Dr, Prof, Sir, Dame, Lord, Lady, Rev, Father, Mother, Brother, Sister, Saint, St, Pope
- Suffixes: Jr, Sr, Junior, Senior, I-XV (Roman numerals), PhD, MD, Esq
- Accents and diacritics normalized to ASCII
- Apostrophes removed (O’Brien → obrien)
- Hyphens preserved in hyphenated names
Publisher Name Normalization
Removes business suffixes and common words:
import { normalizePublisherName } from "@colibri-hq/sdk/ingestion";
normalizePublisherName("Penguin Books Ltd."); // "penguin"
normalizePublisherName("The Penguin Press"); // "penguin"
normalizePublisherName("O'Reilly Media"); // "oreilly media"
normalizePublisherName("McGraw-Hill"); // "mcgraw-hill"Supported transformations:
- Business suffixes: Ltd, LLC, Inc, Corp, Co, Company
- Publishing words: Publishing, Publishers, Book(s), Press, Group
- Geographic terms: International, Worldwide
- Parentheticals removed (e.g., “Penguin (US)” → “penguin”)
Fuzzy Matching
Find similar creators or publishers in the database:
import { findSimilarCreators, findSimilarPublishers } from "@colibri-hq/sdk";
// Find creators with similar names (default threshold: 70%)
const similar = await findSimilarCreators(database, "J. K. Rowling", 0.7);
// Returns: [{ creator: { name: "J.K. Rowling", ... }, similarity: 0.92 }]
// Find publishers with similar names
const matches = await findSimilarPublishers(database, "Penguin Books", 0.7);
// Returns: [{ publisher: { name: "Penguin", ... }, similarity: 0.87 }]During ingestion, fuzzy matching is applied automatically to prevent duplicate creators and publishers.
Confidence Scoring
Confidence scores indicate the reliability of enriched metadata.
Confidence Tiers
| Range | Tier | Meaning |
|---|---|---|
| 0.95-1.00 | Exceptional | Very high confidence, multiple authoritative sources |
| 0.90-0.95 | Strong | High confidence, reliable data |
| 0.70-0.90 | Good | Good confidence, likely accurate |
| 0.50-0.70 | Moderate | Moderate confidence, may need verification |
| 0.30-0.50 | Weak | Low confidence, verification recommended |
| 0.00-0.30 | Poor | Very low confidence, likely unreliable |
Confidence Calculation
Confidence scores are calculated based on:
- Base Confidence: Provider’s initial confidence (0.5-0.9)
- Source Count Boost: +0.05 per additional source (max +0.15)
- Agreement Boost: +0.10 when sources agree on values
- Reliability Boost: +0.08 for high-quality providers
- Disagreement Penalty: -0.10 to -0.20 for conflicting data
Example:
WikiData alone: 0.90
+ LoC agreeing: 0.95 (+0.05 source boost)
+ OpenLib agreeing: 0.98 (+0.03 source boost, +0.05 agreement boost)Advanced Usage
Conflict Resolution
Handle conflicting metadata from different providers:
import { resolveConflict } from "@colibri-hq/sdk/ingestion";
// Multiple providers return different publication dates
const allRecords = [
{ ...wikidataRecord, publicationDate: new Date("2020-01-15") },
{ ...locRecord, publicationDate: new Date("2020-01-01") },
{ ...openlibRecord, publicationDate: new Date("2019-12-31") },
];
// Resolve conflict
const resolution = resolveConflict(allRecords, "publicationDate", (date) =>
date.getFullYear().toString(),
);
if (resolution.hasConflict) {
console.log("Conflict detected!");
console.log("Chosen value:", resolution.value);
console.log("Alternatives:", resolution.alternatives);
}Series Detection
Automatically detect and create series:
import { findOrCreateSeries, addWorkToSeries } from "@colibri-hq/sdk";
// Extract series from metadata
if (metadata.series) {
const series = await findOrCreateSeries(database, metadata.series.name, {
language: metadata.language || "en",
userId: userId,
});
await addWorkToSeries(database, workId, series.id, metadata.series.position);
}Tag Extraction
Extract and link tags from subjects:
import { findOrCreateTags, addTagsToWork } from "@colibri-hq/sdk";
// Extract subjects from metadata
const subjects = metadata.subjects || [];
// BISAC subjects are automatically parsed
// Input: "FICTION / Romance / Historical / Scottish"
// Output: ["fiction", "romance", "historical", "scottish"]
const tags = await findOrCreateTags(database, subjects, { userId });
await addTagsToWork(
database,
workId,
tags.map(({ id }) => id),
);Custom Metadata Conversion
Convert provider metadata to ingestion format:
import { convertToExtractedMetadata, mergeMetadataRecords } from "@colibri-hq/sdk/ingestion";
// Query providers manually
const wikidata = new WikiDataMetadataProvider();
const openlib = new OpenLibraryMetadataProvider();
const [wikidataResults, openlibResults] = await Promise.all([
wikidata.searchByISBN("9780743273565"),
openlib.searchByISBN("9780743273565"),
]);
// Merge results
const merged = mergeMetadataRecords([...wikidataResults, ...openlibResults]);
console.log("Merged metadata:", merged);Error Handling
The enrichment system handles errors gracefully:
Provider Failures
// Provider failures don't stop enrichment
const result = await enrichMetadata(metadata, providers);
// If WikiData fails, other providers still run
// If all providers fail, returns empty enrichment
console.log("Errors:", result.errors);
// Map of provider name to errorIngestion Failures
// Ingestion continues even if enrichment fails
try {
const result = await ingestWork(database, {
file,
metadata,
enrich: true, // Enrichment is best-effort
});
} catch (error) {
console.error("Ingestion failed:", error);
// Work is created with ebook metadata even if providers are unavailable
}Performance Considerations
Caching
Enable provider caching to reduce API calls:
import { CacheableMetadataProvider } from "@colibri-hq/sdk/metadata";
const cached = new CacheableMetadataProvider(provider, {
maxSize: 1000,
defaultTtl: 300000, // 5 minutes
});Rate Limiting
Providers automatically handle rate limiting:
- WikiData: 60 requests/min, 1s delay
- Library of Congress: 30 requests/min, 2s delay
- Open Library: 100 requests/min, 200ms delay
Timeouts
Set appropriate timeouts to prevent hanging:
const result = await enrichMetadata(metadata, providers, {
timeout: 30000, // 30 seconds total
});Parallel vs Sequential
- Parallel: Faster, use for ISBN or Title+Author searches
- Sequential: Slower but more reliable, use for title-only searches
Database Schema
The ingestion system creates these entities:
Work
The abstract book entity:
interface Work {
id: string;
title: string;
titleSortKey: string; // For alphabetical sorting
synopsis?: string;
language?: string;
datePublished?: Date;
createdAt: Date;
updatedAt: Date;
}Edition
A specific published edition of a work:
interface Edition {
id: string;
workId: string;
title?: string; // Edition-specific title
publishedDate?: Date;
publisher?: string;
numberOfPages?: number;
format?: string; // 'epub' | 'mobi' | 'pdf'
createdAt: Date;
}Asset
The physical ebook file:
interface Asset {
id: string;
editionId: string;
filename: string;
mimetype: string;
checksum: string; // SHA-256
size: number;
storageKey: string; // S3 key
createdAt: Date;
}Relationships
Work ─┬─ Edition ── Asset
├─ Contribution ── Creator
├─ CollectionItem ── Collection
├─ WorkTag ── Tag
├─ SeriesItem ── Series
└─ CommentTypeScript Support
The SDK is fully typed:
import type {
ExtractedMetadata,
IngestWorkOptions,
IngestWorkResult,
DuplicateCheckResult,
EnrichmentResult,
} from "@colibri-hq/sdk/ingestion";Complete Example
Full ingestion pipeline with enrichment and duplicate handling:
import { ingestWork, detectDuplicates } from "@colibri-hq/sdk/ingestion";
import { parseEpub } from "@colibri-hq/sdk/ebooks";
import { globalProviderRegistry } from "@colibri-hq/sdk/metadata";
async function importEbook(file: File, userId: string) {
// 1. Extract metadata from file
const fileBuffer = await file.arrayBuffer();
const metadata = await parseEpub(fileBuffer);
console.log("Extracted:", metadata.title);
// 2. Check for duplicates
const duplicates = await detectDuplicates(database, {
metadata,
checksum: await calculateChecksum(fileBuffer),
});
if (duplicates.exactMatch) {
console.log("File already exists, skipping");
return { status: "skipped", reason: "duplicate-file" };
}
// 3. Ingest with enrichment
const providers = globalProviderRegistry.getEnabledProviders();
const result = await ingestWork(database, {
file: fileBuffer,
filename: file.name,
metadata,
userId,
enrich: true,
enrichProviders: providers,
onDuplicateWork: "prompt",
onDuplicateEdition: "add-edition",
});
if (result.status === "needs-confirmation") {
// Display confirmation dialog
return result;
}
console.log("Imported:", result.work.title);
console.log("Confidence:", result.enrichmentResults?.confidence);
return result;
}Related Documentation
- Metadata SDK - Metadata provider system
- SDK Overview - Core SDK features
- MOBI Parser - MOBI format details