How to Collect the Images and Meta Tags from a Webpage with PHP

Meta Tags and the Facebook Example

You’ve definitely seen the “share a link” screen in Facebook. When you paste a link into the box (fig. 1) and press the “Attach” button you’ll get the prompted cite parsed with a title, description and possibly thumb (fig. 2). This functionality is well known in Facebook, but it appears to be well known also in various social services. In fact Linkedin, Reddit, Dzone‘s bookmarklet use it.

Facebook Attach a Link Prompt Screen
fig. 1 - Facebook Attach a Link Prompt Screen

Fist thing to notice is that this information, prompted by Facebook, is the same as the meta tag information. However there is a slight difference.

Facebook Attached Link Screen
fig. 2 - Facebook Attached Link Screen

Facebook prefers for the thumb the image set into the <meta property=”og:image” … />. In the case above this tag appears to be:

<meta property="og:image" content="http://b.vimeocdn.com/ts/572/975/57297584_200.jpg" />

And the image pointed in the SRC attribute is exactly the same as the one prompted by Facebook (fig. 3).

Vimeo Thumb
fig. 3 - Vimeo Thumb

First thing to note is that the real thumb is bigger than the thumb shown in Facebook, so Facebook resizes it and the second thing to note is that there are more meta tags of the og:… format.

Meta Tags and The Open Graph Protocol

By default meta tags contain various information about the web page. They are not visible in the webpage, but contain some info about it. The most common meta tags are the title, description and keywords tags. They of course contain the title of the page, not that this can be different from the <title> tag, a short description of the page and some keywords describing the content of the page. They are well known also because the search engines make use of them when trying to collect information about the page and the process of SEO passes through it.

However the default HTML meta tags cannot contain everything. Thus for example you cannot point the preferable thumbnail for a webpage. The solution is the Open Graph Protocol. It comes with meta tags that can contain more and more valuable info. Such a tag is the og:image meta tag. Note that all the Open Graph (og) meta tags are defined by the og: prefix before the entity name. Thus og:image comes for images, while og:longitude for geo positioning.

That’s really useful, but how you can read them?

PHP, Meta Tags and Regexps

When you try to read information from a webpage source the first possible path is by using regular expressions. However PHP is smart enough to offer you some useful functions. Such a function is get_meta_tags(). As you may guess this method reads the meta tags by given URL.

$a = get_meta_tags('http://vimeo.com/10758212');
var_dump($a);

However this method can’t read Open Graph tags. So finally you’ve to use some regexps.

preg_match('/<meta property="og:image" content="(.*?)" \/>/', $source, $matches);

Now you can grab the og:image tag. And even more – grab every image (<img>) from that page.

preg_match_all('/<img src="(.*?)"/', $source, $m);

6 thoughts on “How to Collect the Images and Meta Tags from a Webpage with PHP

  1. Pingback: abcphp.com
  2. Hm, shouldn’t this work?

    $source = file_get_contents('http://nrkp3.no/filmpolitiet/2011/07/tidenes-spilloppgjoer-i-miniatyrformat/');
     
    if (preg_match('//', $source)) {
        echo "A match was found.";
    } else {
        echo "A match was not found.";
    }
  3. ‘attachment’,
    ‘orderby’ => ‘menu_order’,
    ‘order’ => ‘ASC’,
    ‘post_mime_type’ => ‘image’ ,
    ‘post_status’ => null,
    ‘numberposts’ => null,
    ‘post_parent’ => $post->ID,
    ‘exclude’ => get_post_thumbnail_id()
    );
    $attachments = get_posts($args); ?>

    ID, ‘_wp_attachment_image_alt’, true);
    $image_title = $attachment->post_title;
    $caption = $attachment->post_excerpt;
    $description = $attachment->post_content;
    $src = wp_get_attachment_image_src( $attachment->ID, ‘full’ );
    list($width, $height, $type, $attr) = getimagesize($src[0]); ?>

    TRY THIS

    <img src="” class=”project-images

  4. I couldn’t get the above to work, but here’s what worked for me:


    include '../simple_html_dom.php'; //PHP Simple DOM parser http://simplehtmldom.sourceforge.net/

    $html = file_get_html('https://onarbor.com/');

    preg_match_all('//', $html, $matches);

    print_r($matches);

Leave a Reply

Your email address will not be published. Required fields are marked *