Tool/algroveon-parser-–-rss-and-atom-parser-without-external-dependencies/

Algroveon-Parser – RSS and Atom Parser without external dependencies

Slim feed parser in pure Python for RSS 2.0, RSS 0.91, and Atom 1.0.

This Python library is designed for parsing RSS and Atom feeds, deliberately relying solely on the Python standard library. It processes raw feed data without external dependencies and was created as a direct replacement for an external parser solution because I wanted to see how much I could achieve on my own while keeping external dependencies to an absolute minimum.

What the Parser Does

  • Formats: RSS 2.0, RSS 0.91, Atom 1.0, RDF-based RSS 1.0 feeds
  • Output: Typed Feed and Entry dataclasses, ready for immediate use
  • HTML Sanitizer: Allowlist-based, XSS-secure, produces cleaned HTML and plain text
  • Image Extraction: media:thumbnailmedia:content → first <img> from content → summary
  • Date Normalization: RFC 2822 and ISO 8601, always timezone-aware
  • Encoding Fallback: Tolerates feeds that declare incorrect encoding

Modules

Module Task
parser.py Format detection, dispatcher, encoding fallback
rss2.py RSS 2.0 / 0.91 parser including content:encoded, dc:creator, media:*
atom.py Atom 1.0 parser including <link rel="alternate">
sanitize.py HTML sanitizer + plain-text extraction
images.py Image URL extraction from XML elements and HTML content
date.py RFC-2822 and ISO-8601 date normalization
models.py Feed and Entry dataclasses

Namespaces and Real-World Feeds

Developed and tested against 16 real feeds (as of March 2026): Tagesschau, Spiegel, Süddeutsche, Zeit, Heise, The Verge, Handelsblatt, WiWo, Postillon, and others. Supported namespaces: content:encoded, dc:creator, dc:date, media:thumbnail, media:content. The scope will be significantly expanded, but it already serves as good training to ensure the parser can work as universally as possible with various technologies and feed variants in the long term.

No HTTP client included – intentionally. The parser accepts raw bytes, making it independent of the transport layer.

Repository Status
sebmeisinger / algroveon-parser

Running embedded within the Algroveon news infrastructure. Packaging as a standalone, cleanly distributable module is not yet complete.

[SYS] Extracting from monorepo
Awaiting Refactor