The Pandora Papers’s 11.9 million records arrived from 14 different offshore services firms in a jumble of files and formats – even ink-on-paper – presenting a massive data-management challenge

2.94 terabyte data trove exposes the offshore secrets of wealthy elites from more than 200 countries and territories. These are people who use tax and secrecy havens to buy property and hide assets; many avoid taxes and worse. They include more than 330 politicians and 130 Forbes billionaires, as well as celebrities, fraudsters, drug dealers, royal family members and leaders of religious groups around the world.

The International Consortium of Investigative Journalists spent more than a year structuring, researching and analyzing the more than 11.9 million records in the Pandora Papers leak. The task involved three main elements: journalists, technology and time.

What is the Pandora Papers?

The Pandora Papers investigation is the world’s largest-ever journalistic collaboration, involving more than 600 journalists from 150 media outlets in 117 countries.

The investigation is based on a leak of confidential records of 14 offshore service providers that give professional services to wealthy individuals and corporations seeking to incorporate shell companies, trusts, foundations and other entities in low- or no-tax jurisdictions. The entities enable owners to conceal their identities from the public and sometimes from regulators. Often, the providers help them open bank accounts in countries with light financial regulation.

The 2.94 terabytes of data, leaked to ICIJ and shared with media partners around the world, arrived in various formats: as documents, images, emails, spreadsheets, and more.

The records include an unprecedented amount of information on so-called beneficial owners of entities registered in the British Virgin Islands, Seychelles, Hong Kong, Belize, Panama, South Dakota and other secrecy jurisdictions. They also contain information on the shareholders, directors and officers. In addition to the rich, the famous and the infamous, those exposed by the leak include people who don’t represent a public interest and who don’t appear in our reporting, such as small business owners, doctors and other, usually affluent, individuals away from the public spotlight.

While some of the files date to the 1970s, most of those reviewed by ICIJ were created between 1996 and 2020. They cover a wide range of matters: the creation of shell companies, foundations and trusts; the use of such entities to purchase real estate, yachts, jets and life insurance; their use to make investments and to move money between bank accounts; estate planning and other inheritance issues; and the avoidance of taxes through complex financial schemes. Some documents are tied to financial crimes, including money laundering.

What’s in the Pandora Papers?

The more than 330 politicians exposed by the leak were from more than 90 countries and territories. They used entities in secrecy jurisdictions to buy real estate, hold money in trust, own other companies and other assets, sometimes anonymously.

The Pandora Papers investigation also reveals how banks and law firms work closely with offshore service providers to design complex corporate structures. The files show that providers don’t always know their customers, despite their legal obligation to take care not to do business with people who engage in questionable dealings.

The investigation also reports on how U.S. trust providers have taken advantage of some states’ laws that promote secrecy and help wealthy overseas clients hide wealth to avoid taxes in their home countries.

What form did the data come in?

The 11.9 million-plus records were largely unstructured. More than half of the files (6.4 million) were text documents, including more than 4 million PDFs, some of which ran to more than 10,000-pages. The documents included passports, bank statements, tax declarations, company incorporation records, real estate contracts and due diligence questionnaires. There were also more than 4.1 million images and emails in the leak.

Spreadsheets made up 4% of the documents, or more than 467,000. The records also included slide shows and audio and video files.



What’s different about this leak from others we’ve heard about?

The Pandora Papers information – the 2.94 terabytes in more than 11.9 million records – comes from 14 providers that offer services in at least 38 jurisdictions. The 2016 Panama Papers investigation was based on 2.6 terabytes of data in 11.5 million documents from a single provider, the now-defunct Mossack Fonseca law firm. The 2017 Paradise Papers investigation was based on a leak of 1.4 terabytes in more than 13.4 million files from one offshore law firm, Appleby, as well as Asiaciti Trust, a Singapore-based provider, and government corporate registries in 19 secrecy jurisdictions.

The Pandora Papers presented a new challenge because the 14 providers had different ways of presenting and organizing information. Some organized documents by client, some by various offices, and others had no apparent system at all. A single document sometimes contained years’ worth of emails and attachments. Some providers digitized their records and structured them in spreadsheets; others kept paper files that were scanned. Some PDFs contained spreadsheets that had to be reconstructed into spreadsheets. The documents arrived in English, Spanish, Russian, French, Arabic, Korean and other languages, requiring extensive coordination among ICIJ partners.

The Pandora Papers gathered information on more than 27,000 companies and 29,000 so-called ultimate beneficial owners from 11 of the providers, or more than twice the number of beneficial owners identified in the Panama Papers.

The Pandora Papers connected offshore activity to more than twice as many politicians and public officials as did the Panama Papers. And the Pandora Papers’ more than 330 politicians and public officials, from more than 90 countries and territories , included 35 current and former country leaders.

The new leak also includes information on jurisdictions not explored in previous ICIJ projects or for which there was little data, such as Belize, Cyprus and South Dakota.

The legal entities in the files of six providers – the companies, foundations and trusts – were all registered between 1971 and 2018. The records show providers and clients shifting their business from one jurisdiction to another after investigations and resulting rule changes.

How did you explore the files?

Only 4% of the files were structured, with data organized in tables (spreadsheets, csv files and a few “dbf files”).

To explore and analyze the information in the Pandora Papers, ICIJ identified files that contained beneficial ownership information by company and jurisdiction and structured it accordingly. Each provider’s data required a different process.

In cases where information came in spreadsheet form, ICIJ removed duplicates and combined it into a master spreadsheet. For PDF or document files, ICIJ used programming languages such as Python to automate data extraction and structuring as much as possible.

In more complex cases, ICIJ used machine learning and other tools, including the Fonduer and Scikit-learn softwares, to identify and separate specific forms from longer documents.

Some provider forms were handwritten, requiring ICIJ to extract information manually.

Once information was extracted and structured, ICIJ generated lists that linked beneficial owners to the companies they owned in specific jurisdictions. In some cases, information about where or when a company was registered wasn’t available. In others, information was missing about when a person or an entity had become the owner of the company, among other details.

After structuring the data, ICIJ used graphic platforms (Neo4J and Linkurious) to generate visualizations and make them searchable. This allowed reporters to explore connections between people and companies across providers.