Web Crawling

Announcements, comments, ideas, feedback, and "How do I... ?" questions
Post Reply
MavicMaverick
Posts: 4
Joined: Tue Apr 14, 2020 10:47 am

Web Crawling

Post by MavicMaverick »

Dear all,

I am currently learning how to do web scraping and wanted to practice my skills by scraping the question information from ProjectEuler as part of my CI/CD solution I talked about previously in another thread [link]viewtopic.php?f=5&t=7070[/link]. I want to emphasise that I did not SHARE or USE the questions for my OWN BENEFIT. Anyway, after a web crawl across the puzzle index, I found that I had lost the access to view the website in what seemed like my IP was blacklisted for having made too many requests.

I sincerely apologise to the administrators for any inconveniences caused by my antics and would like to ask if it is okay for me to use web scraping as part of my answer checking program or not. If it is possible for me to use, please could you tell me the maximum concurrent requests I can make (I can adjust these settings in my Web Crawler), whether there is a robots.txt file which contains web crawling restrictions which my bot should obey, a delay between requests and so on so that I do not get blacklisted each time for having seemingly done a DoS or DDoS attack? Also, I would like to note that all web crawler requests made have a header containing my details, if you need to contact me.

I look forward to hearing from you,
MavicMaverick

User avatar
hk
Administrator
Posts: 10817
Joined: Sun Mar 26, 2006 10:34 am
Location: Haren, Netherlands

Re: Web Crawling

Post by hk »

Incidental scraping of public information should pose no problem I think.
However, large scale scraping eats our rare resources and is discouraged.
If you want to do that for some reason you can have access to public data using the commands you can find in viewtopic.php?p=32233#p32233
Image

MavicMaverick
Posts: 4
Joined: Tue Apr 14, 2020 10:47 am

Re: Web Crawling

Post by MavicMaverick »

Thank you. I shall proceed that way.

Post Reply