I do a lot of scraping with PHP. Typically it's really easy, the HTML is rendered in a consistant format, and templates are used to make all the pages the same.
There's a current trend of moving towards javascript-rendered pages (either back-end or front-end), which means that traditional means of scraping just don't work.
Instead, you need a browser to render the page first, and then do your normal extraction.
Google's Chrome browser, plus a headless API known as Puppeteer to the rescue!
To set it all up on a CentOS 8 environment, you'll need to do the following:
sh -c 'echo -e "[google-chrome]\nname=google-chrome - 64-bit\nbaseurl=http://dl.google.com/linux/chrome/rpm/stable/x86_64\nenabled=1\ngpgcheck=1\ngpgkey=https://dl-ssl.google.com/linux/linux_signing_key.pub" >> /etc/yum.repos.d/google-chrome.repo'
yum update
yum install google-chrome-stable
yum install nodejs
npm install -g puppeteer --unsafe-perm=true
yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
yum install https://rpms.remirepo.net/enterprise/remi-release-8.rpm
yum install yum-utils
dnf module reset php
dnf module install php:remi-7.4
yum -y install php php-pecl-memcache php-pecl-memcached php-pecl-mysql php-fpm php-opcache httpd php-gd rsync
composer require helloiamlukas/chrome-php
You can now run the following PHP code to get your fully-rendered page.
include_once("vendor/autoload.php");
use ChromeHeadless\ChromeHeadless;
$html = ChromeHeadless::url("https://www.google.com/")->getHtml();
For some pages, you'll need to let the Chrome browser fully load the page, and then render all the appropriate components.
To enable this to happen, you'll need to amend the const response = await page.goto line...
const response = await page.goto(options.url, {'waitUntil':'networkidle2'});
Enjoy :)