Which Countries Were in the News Over the Years? Analyzing a Data Dump from Social Media

The bar chart shows the sum of all upvotes mentioning the country or words related to it (like the capital city, national language etc). This obviously doesn’t detect every instance reliably but it managed to hit 3M out of 4.2M total posts.

Alternatively I could have counted the number of posts (not the total upvotes). However, that is prone to botting. There are a very high numbers of posts with 1 or 2 upvotes. The median post just has 1 upvotes. The average is 190. 95% of posts got less than 170 upvotes. So there is a high variance. It doesn’t tell which news topics were actually being discussed. Though the final bar chart doesn’t look much different from this.

Posts with upvotes in the thousands are very rare. The highest one got 240k.

The Data

The data is a dump of all the posts from a popular subreddit 'worldnews' which, as the name implies, contain news posts shared by different users with others in the group. It contains over 4 million news posts dating from 2008 to 2024. The file is about 8GBs in size.

Loading:

The file is in JSONL format i.e. a list of JSON objects, which represent one post each, separated by newline (\n). The data is read in using standard JSON python library. Most data fields are thrown out and only a few are read and saved in memory.

import json
import sys
input_file = "r_worldnews_posts-all.jsonl"

def process_file(input_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        records = []
        for i, line in enumerate(f):
            record = json.loads(line)
            records.append({
                'created_utc': record['created_utc'],
                'num_comments': record['num_comments'] if isinstance(record['num_comments'], (int, float)) else 0,
                'score': record['score'] if isinstance(record['score'], (int, float)) else 0,
                'title': record['title'].replace('\n', ' ').replace('\r', '')
            })
            if i % 1000 == 0:
                print(f"Processed {i} records", end='\r')
        return records


records = process_file(input_file)

Filtering:

To associate a post with different countries, we select a list of country names and search for them across all posts. The search is a simple substring match. Since a country can be mentioned with different variations of the name, the name list also contains some aliases and popular figures. Search hit for the aliases counts for its country. The names file looks like this:

...
New Zealand,NZ,Kiwi,New Zealander,Wellington,Auckland,Christchurch,Queenstown,Jacinda
Oman,Omani,Muscat,Sultan Qaboos
Pakistan,Pakistani,Islamabad,Karachi,Lahore,Peshawar ....
...

The names are loaded into an array and matched against the post titles:

# Load the list of countries with their alternative names
country_file = 'common-countries.txt'
countries = []
with open(country_file, 'r', encoding='utf-8') as f:
    for line in f:
        country_alts = line.strip().split(',')
        countries.append({
            'name': country_alts[0],
            'alts': country_alts[1:]
        })

# Search the names across all posts
records_enriched = [] 
fcount = 0
for record in records:
    # check if country name or its alternative name is in the title
    # if so add the record to the enriched list 2
    for country in countries:
        if country['name'] in record['title'] or any(alt in record['title'] for alt in country['alts']):
            records_enriched2.append({
                'country': country['name'],
                'created_utc': record['created_utc'],
                'num_comments': record['num_comments'],
                'score': record['score'],
                'title': record['title']
            })
            continue #
    fcount += 1
    if fcount % 1000 == 0:
        print(f"Processed {fcount} records", end='\r')

The list is still missing a lot of possible hits since there can be many more 'aliases' but it's still able to catch 3 million posts out of 4M total. There is an alternate method to search the strings faster using sets by tokenizing the the sentences into words and matching with the list of countries. That's about 40 times faster than the method above, however, there is the issue of multiword names which cannot be split up. It's able to match about 2M/4M posts.

Exporting For Web

The plots above were manually written HTML/CSS/Javascript. A bar chart could have been easily produced in python but I wanted a racing/animated bar chart. Also, it allows us to do interactive demos for the web like the line graph above. To do so, the data needs to repackaged in a way that is easily consumable by a frontend browser. To reduce the bulk, I reduced the data further by limiting it to week as the maximum time resolution.

First we process the data into a table format. The table is then exported onto disk. The file is loaded into our html in using JS. The script then parses and draws the data usings SVGs.

The filtered list is converted into a dataframe which allows us to group the data by week and create corresponding columns which (1) sum the upvotes for the week and store the (2) top post's title along with its (3) upvotes.

import pandas as pd

df = pd.DataFrame(records_enriched)
df['created_utc'] = pd.to_datetime(df['created_utc'])
df['week'] = df['created_utc'].dt.to_period('W') 

# - Group by country and week
# - 'score' of that week of that country.
# - 'title' with most score in that week for that country.
# - score of the chosen title of that week for that country.
result = df.groupby(['country', 'week']).apply(lambda x: pd.Series({
    'sum_score': x['score'].sum(),
    'title_with_max_score': x.loc[x['score'].idxmax(), 'title'],
    'score_of_title_with_max_score': x['score'].max()
})).reset_index()

# merge 'sum_score' , 'title_with_max_score' , 'score_of_title_with_max_score' into 1 column with ';' separator
result['summary'] = result['sum_score'].astype(str) + ';' + result['title_with_max_score'] + ';' + result['score_of_title_with_max_score'].astype(str)
result.drop(['sum_score', 'title_with_max_score', 'score_of_title_with_max_score'], axis=1, inplace=True)
result = result.pivot(index='country', columns='week', values='summary')

result.to_csv('parsed4web.dat', sep='~', line_terminator='\n')

The table is now in the following format:

| country | week1       | week2       | .... |
| USA     | summarytext | summarytext | .... |

A 2D table like this is easier to parse and read using basic string operations. However, since a table only gives us one data point for each row (country) column (week) pair, we can concatenate multiple datapoints and store it in the same 'cell'. So 'summarytext' itself is a concatenated string of three strings:

summarytext = total points;top title;top title points

Visualisation:

The table is now separated by separators/delimeters. Rows are separated by \n. columns by ~ and the summarytext string by ;.

var weekheaders;
let country_names;

let country_week_total_points;
let country_week_titles;
let country_week_title_points;

const svgNamespace = "http://www.w3.org/2000/svg"
let country_colors = [];

function parseCSV(csv) {
    var lines = csv.split('\n');
    var result = [];
    var headers = lines[0].split('~');

    /* format
    * | country | week1       | week2       | .... |
    * | USA     | summarytext | summarytext | .... |
    * 
    * summarytext = total points;top title;top title points
    * Delimiters: \n   ~   ;
    */

    // 1d arrays
    weekheaders = headers.slice(1);
    country_names = [];

    // unpacked summary text into 3 x 2D arrays
    num_countries = lines.length - 1;
    num_weeks = headers.length - 1;
    country_week_total_points = new Array(num_countries);
    country_week_titles = new Array(num_countries);
    country_week_title_points = new Array(num_countries);

    // Iterate over each country
    for (var c = 1; c < lines.length; c++) {
        if (lines[c].length == 0) {
            continue;
        }
        var obj = {};
        var currentline = lines[c].split('~');

        // country name
        country_names.push(currentline[0]);

        // Iterate over each week
        country_week_total_points[c - 1] = new Array(num_weeks);
        country_week_titles[c - 1] = new Array(num_weeks);
        country_week_title_points[c - 1] = new Array(num_weeks);
        for (var w = 1; w < currentline.length; w++) {
            // default values
            let total_points = 0;
            let titles = '';
            let title_points = 0;

            // Unpack summary text (if valid)
            summary_stiched = currentline[w]
            if (summary_stiched) {
                let summary = summary_stiched.split(';');
                if (summary.length == 3) {
                    total_points = parseInt(summary[0].replace("\"", ""));
                    titles = summary[1];
                    title_points = parseInt(summary[2].replace("\"", ""));
                }
            }
            country_week_total_points[c - 1][w - 1] = total_points;
            country_week_titles[c - 1][w - 1] = titles;
            country_week_title_points[c - 1][w - 1] = title_points;
        }
    }
}

It produces 3 2D arrays which can be used to draw SVG plots in a web page.

Resources

- Arctic Shift repo on GitHub

- Countries list with alias (common-countries.txt)

- Visualisation using JS (graph.js)

Tags: Data Science