Roaster
EN / RU
Large Scale Article Extract of Newspapers 1730s-1960s

Large Scale Article Extract of Newspapers 1730s-1960s

Hello HN, over the past 7 months I've spent nearly 3,000 hours on building SNEWPAPERS, the first historical newpaper archive with full-text extractions, nearly perfect OCR, a vast categorization taxonomy and of course with semantic and agentic search capabilities. Problem: I wanted to search through newspaper archives, but when I tried every service only lets you search for keywords and dates, and gives you back raw images of the papers, and too many of them with no context. A sea of noise. Solution: I taught machines how to read the newspapers and so far I've extracted the content from > 600k pages (about 5TB) from the Chronicling America collection. Problems I had to deal with were an infinite variety of layouts, font sizes, image scan qualities, resolutions, aspect ratios, navigating around the images on the page. I also had to figure out how to get OCR to be nearly perfect so people wouldn't hate reading the extracts. I stitched together a multi-model pipeline (layout tech, ocr tech, llm, vllm) with heuristics to go from layout -> segmentation -> classification. I put it all in OpenSearch / Postgres and made it semantically searchable and also put an agentic search tool on top that knows how to use the API really well and helps you write queries to find what you're looking for. Happy to discuss AWS architecture and scaling as well, that was tough! If you have five minutes and you just want to jump in and have your own personalized experience, what I would suggest is: Before searching for anything, go to the Sleuth page Ask it about anything from 1736 to 1963, maybe 1 or 2 follow up questions Then go to the search page so you can see the queries it wrote for you (bottom left "saved queries") and uncover more info on whatever it is you're interested in If you think it's cool and you want to learn more, then there's about 10 minutes of video guides on the various capabilities in "Guide" on the nav bar Some other people have also taken a crack at this, notably: https://dell-research-harvard.github.io/resources/americanst... (very good attempt) https://labs.loc.gov/work/experiments/newspaper-navigator/ (focused on images)

Developer Tools BOTH · brettnbutter
N/A
Revenue not available

AI Analysis

Analysis coming soon.

Similar Products

Developer Tools
Capgo

Capgo

Instant updates for Capacitor apps. Ship fixes in minutes, not weeks. Push OTA updates to users without app store delays.

$15.2K /mo
Developer Tools Easy to clone
OpenAlternative

OpenAlternative

Open source alternatives to popular software. Over 1 million users replaced their proprietary tools with open source software. Discover the best alternatives and join the movement.

$6.7K /mo
Developer Tools
Garden of Flowers

Garden of Flowers

Hey all, I made this. The archive started with my 2015 BA thesis on Amiga ASCII art when I was curious about the history of ASCII art but found very little on text art that came before it. The historical precursors are often attributed to typewriter art and shaped/visual poetry, but I think letterpress is overlooked. So, I got slightly obsessed and started a personal database of pictures built entirely from metal type, ornaments, and rule, some going back to the 1600s. After eight years, I've managed to find ~2500 images. My friend Adel Faure built the website so it's now browseable by anyone! I would like to note that most images are from public digital collections (Internet Archive, national libraries, etc.) and displayed without permission (for educational purposes). I've tried to source every image, but check the original source and its license before reusing anything. I'd be happy to take down or correct anything. It's also incomplete and surely has errors and misattributions. Corrections to anything are very welcome. If anyone has leads on works I haven't catalogued, I'd love to hear them! The practice and pictures are scattered across languages and keywords (type picture, typosignet, typotectur, Bildsatz, stigmatypie, stunt typography...), so things hide in odd corners of archives. If you've seen something like this, please point me at it. There's also a longer essay on how it began: https://garden-of-flowers.heikkilotvonen.com/?essay

Revenue N/A
Developer Tools
I built 80 mini-games using Fable before it was shut down

I built 80 mini-games using Fable before it was shut down

Dear Hacker News, I'm kindly asking for your participation in the open beta for my AI-managed mini-games website. Thank you in advance! For a limited time window, I'm setting the all-free feature flag to true. I hope you have a lot of fun exploring the AI's sense for games! Here and there, I tweaked it to help with visual consistency. I would be deeply grateful if you opted into analytics. $2,300 in API tokens... Cheers!

Revenue N/A
Developer Tools
Homebrew 6.0.0

Homebrew 6.0.0

Today, I’m proud to announce Homebrew 6.0.0. The most significant changes since 5.1.0 are a new tap trust security mechanism, the new faster, smaller, default internal Homebrew JSON API, sandboxing on Linux, better defaults informed by our user survey, many brew bundle improvements, improved performance and initial support for macOS 27 (Golden Gate). Happy to discuss any questions here!

Revenue N/A

Quick Facts

Category
Developer Tools
Audience
BOTH
Founder
brettnbutter
Revenue data
Unknown

Share