Crawl results rendered by javascript in an iFrame


#1

Hi,

I am trying to scrape this site to retrieve the ‘Prayer Timings’ section. The main page loads an iframe, then inside the iframe some javascript is executed to actually retrieve the timings. Is it possible for the crawler access the iframe from the context of the main page? I have tried waiting for the target element to resolve without any success.

Thanks.


#2

Here is what I’ve tried so far. However the ‘theHtml’ only ever returns ’ <head></head><body></body>’

exports.apifySettings = {
  startUrls: [ { 'value': 'http://www.dicenter.org/' } ],
  disableWebSecurity: true, // for diff origin iframe access
  pageFunction: function pageFunction (context) {
    const date = new Date()
    const startedAt = Date.now()
    const results = [
      {
        crawlTime: date
      }
    ]

    const extractData = function () {
      // timeout after 15 seconds
      if ((Date.now() - startedAt) > 15000) {
        // // refresh page screenshot and HTML for debugging
        // context.saveSnapshot()
        // save a result
        // context.finish({ results: document.querySelector('iframe#comp-ikj21sz5iframe') })
        const phase1 = document.querySelector('iframe#comp-ikj21sz5iframe') == null
        const phase2 = document.querySelector('iframe#comp-ikj21sz5iframe').contentWindow.document == null
        // const contents = document.querySelector('iframe#comp-ikj21sz5iframe').contentWindow.document
        const theHtml = document.querySelector('iframe#comp-ikj21sz5iframe').contentWindow.document.documentElement.innerHTML
        context.finish({
          results: [
            {
              error: 'timed out',
              phase1: phase1,
              phase2: phase2,
              theHtml: theHtml
            }
          ]
        })
        return
      }

      // if my element still hasn't been loaded, wait a little more
      if (document.querySelector('iframe#comp-ikj21sz5iframe') == null) {
        setTimeout(extractData, 500)
        return
      }

      // if my element still hasn't been loaded, wait a little more
      if (document.querySelector('iframe#comp-ikj21sz5iframe').contentWindow.document.querySelector('#theTable > tbody > tr:nth-child(1) > td:nth-child(2)') == null) {
        setTimeout(extractData, 500)
        return
      }

      // refresh page screenshot and HTML for debugging
      context.saveSnapshot()

      // save a result
      context.finish({ results: results })
    }

    context.willFinishLater()
    extractData()
  }.toString()
}

#3

Hi @Sahir_Hoda,

you can easily enqueue content of iframe as next page and you can scrape it in next pageFunction.
Here is an example for enqueueing iframe content:

const url = document.querySelector('iframe#comp-ikj21sz5iframe').src
context.enqueuePage({ url, userData: { label: 'iframeContent' } })

Here is an example how you can check label and scrape content:

if (context.request.userData && context.request.userData.label === 'iframeContent') {
  // scrape iframe content there and return data
}

Let me know if it helps.


#4

It looks like I cannot access the src attribute of the iframe. document.querySelector('#comp-ikj21sz5iframe').src is an empty string, even if I wait 15 seconds.

I am trying this to wait, however src never resolves:

        if (!(document.querySelector('#comp-ikj21sz5iframe').src)) {
          console.log('waiting for src element')
          Object.keys(document.querySelector('#comp-ikj21sz5iframe')).forEach(function (k) {
            console.log('[' + k + ']: ' + document.querySelector('#comp-ikj21sz5iframe')[k])
          })
          setTimeout(extractData, 500)
          return
        }

And here is the log: https://gist.github.com/theimpostor/918d93603f3ff6d27776d572360a8872

As you can see no element containing the src url of the iframe is rendered.

Is there a different setting I should try? Up until now I have been using the Apify crawler which uses phantomjs I guess. Would I get different/better results using the puppeteer / Apify SDK?
Thanks.