How to pull generated PDF report from website


#1

I have a financial advisor directory service and trying to automate some of my background checking process. I would like to pull their certification information from the SEC. These reports are publicly available but I don’t want to waste time doing it manually.

After inspecting the code, the PDF is automatically generated and has a content disposition of inline and the name is randomly generated but the content type is application\pdf.

Here is a sample url: https://www.adviserinfo.sec.gov/IAPD/Support/ReportViewer.aspx?indvl_pk=5829233

When I try the analyzer tool, it fails to pull anything. I believe this is because the report is built on the fly and takes a few seconds, sometimes up to 20 seconds, to build the report and display in the browser window.

I needs to ideas on how to code the crawler to avoid an empty pull. I have some experience coding VB and HTML, but need to some pointers to get me started.

Thanks,

Dave Matthews


#2

Hello Dave,

the link you provided does not work. That’s because the underlying application is an ASP.NET app and they usually don’t respect any sort of REST or HTTP semantics. This makes them pretty annoying to scrape, but it can be done.

From our two products, Crawler and Actor, you cannot use the Crawler unfortunately, because it does not support these complex workflows. You can use the Actor to implement your own crawler using our SDK. Specifically using the PuppeteerCrawler class.

Are you familiar with modern JavaScript, Node.js and Puppeteer? Since those are the technologies you’ll need to use to get this working. The actual process involves visiting the webpage and then intercepting the HTTP response that holds the PDF and saving it to your hard disk or one of our storages.


#3

Ondra,

Thank you for the reply. Those are not languages that I’m familiar with. I will outsource the job to a coder on Upwork. It is not mandatory that I automate it yet.

Regards,

Dave Matthews


#4

Dave,

Apify actually has an in-house development team that could do just that for you. Or, if price is a deciding factor, a marketplace full of external developers who specialize in web scraping and automation solutions.

You can get in touch directly at https://www.apify.com/request-solution

Regards,

Ondra