The crawl process was very slow, mainly due to overflow limits from API providers and our proxy service. We would have created a cluster, but the expense limited us to hitting a few APIs about once per second. Slowly we got a full crawl of the full 500,000 URLs. Here are some notes on my experience with URL crawling for data collection: Use APIs whenever possible. Aylien was invaluable for performing tasks where node libraries would be
inconsistent. Find a good proxy service that will switch between consecutive calls. Create logic for websites and content types that may cause fax number list errors. Craigslist, PDF, and Word documents caused issues when crawling. Check the collected data diligently, especially during the first few thousand results, to ensure that errors in mining do not create problems with the structure of the collected data. The results We reported our results from the ranking predictions in a separate article, but I wanted to review
some of the interesting insights in the data collected. Most Competitive Niches For this data, we reduced the dataset to include only top 20 rankings and also removed the top four percent of observations based on reference domains. The purpose of removing the top four percent of referring domains was to prevent URLs such as Google, Yelp, and other large websites from unduly influencing the averages. Since we were focusing on service industry