Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Open source focused crawler?
6 points by cookerware on Feb 9, 2014 | hide | past | favorite | 3 comments
Is there an open source crawler/library that will recursively follow only links under a certain xpath and ignore the rest?

I don't want to do an exhaustive crawl of every single link, I want something that will only follow links under a main content area.



I highly recommend Scrapy (http://www.scrapy.org).

From their site:

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.


Check this out : http://commoncrawl.org/

Its not exactly what you are looking for but might help you.


Have you tried BeautifulSoup?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: