Pseudo URL RegEx not containing substring

#1

I send this PseudoURL to my Cheerio Crawler:

'http[s?]://[[-\w.]+]nachrichten.at/nachrichten/[^(?!.*,E).*;art.*$]'

I would like to get those URLs that contain “;art” but not contain “,E” in the URL ending. But this seems not to work? I am not sure if the error is in my regex or in defining the pseudoURL.

Some examples:

I would like to get these:

But NOT theses:

#2

Hello, since you’re using CheerioCrawler, you can use a plain RegExp instance in place of a Pseudo URL string. That makes it a little bit more readable without the extra brackets.

I’ve come up with the following regular expression

/https?:\/\/[-\w.]+nachrichten\.at\/(nachrichten|oberoesterreich)\/.((?!,E$).)*;art((?!,E$).)*$/

It will match your URLs as requested. It’s a little ugly and that’s because JavaScript does not officially support lookbehind assertions, it would be much simpler with a lookbehind. They are available in Node.js since version 10, but we still support version 8, so we can’t really use those yet. But soon we will.