Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My favorite technique is:

wget URL > HTML tidy HTML > XHTML xslt [identity transform based content extraction] XHTML > XML XML > DB

The whole process glued together with PERL or shell scripts. Depending on how you construct your content extraction, this technique can weather lots of the inevitable content style changes and easily adjusts when changes need to be made.



Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: