PER MOLDRUP-DALUM

The Royal Danish Library has made public the OCR text of a large amount of newspapers published during the years from 1666 up to 1877. This newspaper collection is available at the Royal Library Open Access Repository: LOAR.

The collection can also be accessed using an API. The system under LOAR is DSPace. The API is described at DSpace REST API. In this post, I’ll explore this API using R.

You can play with this code your self at a public available RStudio Cloud project. The static document is available as a RPUb at Playing with the LOAR API.

Extending R

Start by loading the Tidyverse, a library for using JSON in R, and a library for date and time.

library(tidyverse)
library(jsonlite)
library(lubridate)

Finding the newspaper collection

The top level of the LOAR hierarchy is something called communities, but we don’t need that concept here. Still, it it gives an idea about where the collections stems from.

fromJSON("https://loar.kb.dk/rest/communities") %>% select(name)
##                                    name
## 1                            AU Library
## 2                     Aarhus University
## 3                       Aquatic Biology
## 4                                  Arts
## 5                     Audio Collections
## 6            Danish School of Education
## 7           Data Management in Practice
## 8             Department of Agroecology
## 9              Department of Bioscience
## 10                               Events
## 11                       IT Development
## 12                              LARM.fm
## 13                                 LOAR
## 14                            Moesgaard
## 15           National Museum of Denmark
## 16                               NetLab
## 17 Newspapers from Royal Danish Library
## 18             Open Digital Collections
## 19                        Research Data
## 20                 Royal Danish Library
## 21  School of Communication and Culture
## 22        School of Culture and Society
## 23                Science and Tecnology
## 24 VDM Video Life Cycle Data Management

But, as we don’t need that hierarchy level, just list the available collections.

fromJSON("https://loar.kb.dk/rest/collections") %>% select(name)
##                                                                              name
## 1                                                                           AnaEE
## 2  Archive for Danish Literature in 3D: Data, Danish Literature & Distant Reading
## 3                                                              Arctic freshwaters
## 4                                  Beretningsarkiv for Arkæologiske Undersøgelser
## 5                                                                 Danmarks Kirker
## 6                                                                 Datasprint 2019
## 7                                                                    Example Data
## 8                                                       Front pages of Berlingske
## 9                                                            LOAR Legal Documents
## 10                                                               Machine learning
## 11                                         Military land use in Denmark 1870-2017
## 12                                                                    NetLab data
## 13                                                           Newspapers 1666-1678
## 14                                                           Newspapers 1749-1799
## 15                                                           Newspapers 1800-1849
## 16                                                           Newspapers 1850-1877
## 17                                                   Open Access LARM.fm datasets
## 18                                                               Ruben Recordings
## 19                                                                        Sandbox
## 20                                                                      Skolelove
## 21                                                    Soviet and Warsaw-pact maps

So, the newspapers are split into four collections. To get those collections, we need their ids. We’ll store this list of collections and ids for later

fromJSON("https://loar.kb.dk/rest/collections") %>%
  filter(str_detect(name, "Newspaper")) %>% 
  select(name, uuid) -> newspaper_collections
newspaper_collections

What can we then get from a collection? Let’s look at the first using this URL

str_c(
  "https://loar.kb.dk/rest/collections/",
  first(newspaper_collections %>% pull(uuid))
)
## [1] "https://loar.kb.dk/rest/collections/8a36005e-07c2-4e88-ace2-b02eddef07b9"
fromJSON(str_c(
  "https://loar.kb.dk/rest/collections/",
  first(newspaper_collections %>% pull(uuid))
))
## $uuid
## [1] "8a36005e-07c2-4e88-ace2-b02eddef07b9"
## 
## $name
## [1] "Newspapers 1666-1678"
## 
## $handle
## [1] "1902/158"
## 
## $type
## [1] "collection"
## 
## $expand
## [1] "parentCommunityList" "parentCommunity"     "items"              
## [4] "license"             "logo"                "all"                
## 
## $logo
## NULL
## 
## $parentCommunity
## NULL
## 
## $parentCommunityList
## list()
## 
## $items
## list()
## 
## $license
## NULL
## 
## $copyrightText
## [1] ""
## 
## $introductoryText
## [1] "Collection of OCR text in csv files from digitised newspapers. The csv files contain\r\n<ul>\r\n<li>Reference to the scanned newspaper page in <a href=\"http://www2.statsbiblioteket.dk/mediestream/avis\" target=\"_blank\">Newspaper article</a>. This reference will point to the article when there in the search field is inserted recordID: and then the reference surrounded by the sign \".</li>\r\n<li>The date the newspaper was printed</li>\r\n<li>The newspaper id</li>\r\n<li>The scanned newspaper page</li>\r\n<li>Text which was generated by doing OCR of the scanned article</li>\r\n</ul>"
## 
## $shortDescription
## [1] "Collection of OCR text in csv files from digitised newspapers"
## 
## $sidebarText
## [1] ""
## 
## $numberItems
## [1] 13
## 
## $link
## [1] "/rest/collections/8a36005e-07c2-4e88-ace2-b02eddef07b9"

Which items do we have in that collection?

str_c(
  "https://loar.kb.dk/rest/collections/",
  first(newspaper_collections %>% pull(uuid)),
  "/items"
)
## [1] "https://loar.kb.dk/rest/collections/8a36005e-07c2-4e88-ace2-b02eddef07b9/items"
fromJSON(str_c(
  "https://loar.kb.dk/rest/collections/",
  first(newspaper_collections %>% pull(uuid)),
  "/items"
))

Let’s pick the first item for a closer look

fromJSON(str_c(
  "https://loar.kb.dk/rest/collections/",
  first(newspaper_collections %>% pull(uuid)),
  "/items"
)) %>% 
  pull(uuid) %>%
  first() -> uuid
uuid
## [1] "b4fb558a-1c56-42de-8c56-7fff565bb7b4"
fromJSON(str_c("https://loar.kb.dk/rest/items/",uuid))
## $uuid
## [1] "b4fb558a-1c56-42de-8c56-7fff565bb7b4"
## 
## $name
## [1] "Newspapers from 1678"
## 
## $handle
## [1] "1902/179"
## 
## $type
## [1] "item"
## 
## $expand
## [1] "metadata"             "parentCollection"     "parentCollectionList"
## [4] "parentCommunityList"  "bitstreams"           "all"                 
## 
## $lastModified
## [1] "2018-02-05 10:24:08.214"
## 
## $parentCollection
## NULL
## 
## $parentCollectionList
## NULL
## 
## $parentCommunityList
## NULL
## 
## $bitstreams
## NULL
## 
## $archived
## [1] "true"
## 
## $withdrawn
## [1] "false"
## 
## $link
## [1] "/rest/items/b4fb558a-1c56-42de-8c56-7fff565bb7b4"
## 
## $metadata
## NULL

So, this is wierd, because even though the bitstreams value is NULL, I know it contains the actual content of the record/item. Let’s look at that

fromJSON(str_c("https://loar.kb.dk/rest/items/",uuid,"/bitstreams"))

And now we’re close to the actual data. In the above table, the data is available in bitstream with id d2d3869f-ad37-461c-bcb4-79ffc7d9d0fe, and we get it by using the retrieve function from the API. The content is delivered as CSV, and normally I would use the read_csv for such data. But this CSV format has some issues with the encoding of quotes. Therefore, we must use the more general read_delim function with the two escape_ parameters.

fromJSON(str_c("https://loar.kb.dk/rest/items/",uuid,"/bitstreams")) %>%
  filter(name == "artikler_1678.csv") %>%
  pull(retrieveLink) -> artikler_1678_link

artikler_1678_link
## [1] "/rest/bitstreams/d2d3869f-ad37-461c-bcb4-79ffc7d9d0fe/retrieve"
artikler_1678 <- read_delim(
  str_c("https://loar.kb.dk/",artikler_1678_link),
  delim = ",",
  escape_backslash = TRUE,
  escape_double = FALSE)
## Parsed with column specification:
## cols(
##   recordID = col_character(),
##   sort_year_asc = col_character(),
##   editionId = col_character(),
##   newspaper_page = col_double(),
##   fulltext_org = col_character()
## )
artikler_1678
glimpse(artikler_1678)
## Rows: 80
## Columns: 5
## $ recordID       <chr> "doms_newspaperCollection:uuid:082e12e4-cbc1-4502-93d1…
## $ sort_year_asc  <chr> "1678-01", "1678-09", "1678-03-01", "1678-03-01", "167…
## $ editionId      <chr> "dendanskemercurius1666 1678-01 001", "dendanskemercur…
## $ newspaper_page <dbl> 2, 2, 3, 2, 2, 4, 3, 3, 3, 4, 1, 4, 1, 4, 3, 3, 1, 1, …
## $ fulltext_org   <chr> "(Pi", "Aff Kaaber stobte Hals dersom den cene broler …

What can we do with the data?

To get an idea about the amount of data, let’s count pages

artikler_1678 %>% 
  group_by(sort_year_asc) %>% 
  summarise(page_count = sum(newspaper_page)) %>% 
  arrange(desc(sort_year_asc))

Look at the metadata

Unfortunately, to explore the metadata, we have to download all the actual data. Hopefully this will change in the future. Still, the bitstreams do have a sizeBytes value, so let’s collect those, and see how much bandwidth and storage is needed for the full collection. Well, actually for the full four newspaper collections.

So:

for each newspaper collection for each item sum the sizeBytes of bitsteams with name ^artikel_

First look at the first collection

fromJSON(str_c("https://loar.kb.dk/rest/collections/",newspaper_collections %>% pull(uuid) %>% first() , "/items"))

Using that technique we can map the above used fromJSON function onto a list of ids, to get the items from all the newspaper collections

map_df(
  newspaper_collections %>% pull(uuid),
  ~fromJSON(str_c("https://loar.kb.dk/rest/collections/", .x, "/items"))
) -> all_items
all_items

Now, get all the bitstreams associated with those items. Fist we can extract the item id

all_items %>% pull(uuid) %>% first()
## [1] "b4fb558a-1c56-42de-8c56-7fff565bb7b4"

and from that item id, get a bitstream

fromJSON(
  str_c(
    "https://loar.kb.dk/rest/items/",
    all_items %>% pull(uuid) %>% first(), "/bitstreams"
  )
)

The actual bitstream we are interested in, is the one named artikler_1678.csv, and we can see that it is the only one with the CSV format. Filter for that, and just retain the name and sizeBytes

fromJSON(
  str_c(
    "https://loar.kb.dk/rest/items/",
    all_items %>% pull(uuid) %>% first(), "/bitstreams"
  )
) %>% 
  filter(format == "CSV") %>% 
  select(name, sizeBytes)

So, how do we do that for all items? Well, it should be as easy as when getting the items above

map_df(
  all_items %>% filter(row_number() < 4) %>% pull(uuid),
  ~fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams"))
)

But I get that wierd error message?!?

Well, we have this item id: b4fb558a-1c56-42de-8c56-7fff565bb7b4. Which bitstreams does that give

fromJSON(str_c("https://loar.kb.dk/rest/items/", "b4fb558a-1c56-42de-8c56-7fff565bb7b4", "/bitstreams"))

Okay, can we use the map_dffor just that one item?

c("b4fb558a-1c56-42de-8c56-7fff565bb7b4") %>% 
  map_df(
    ~fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams"))
  )

No?! Okay, what if I then select the needed columns, and here the assumption is, that one of the unneeded columns are causing the havoc.

c("b4fb558a-1c56-42de-8c56-7fff565bb7b4") %>% 
  map_df(
    ~(fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams")) %>% select(name,sizeBytes))
)

Oh, that worked! Then try it with a few more items

all_items %>% filter(row_number() < 4) %>% pull(uuid) %>% 
  map_df(
    ~(fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams")) %>% select(name,sizeBytes))
)

YES! Now build the data frame, that I want. First just for 3 rows (row_number() < 4))

all_items %>% filter(row_number() < 4) %>% pull(uuid) %>% 
  map_df(
    ~(fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams")) %>% select(name,sizeBytes, format))
) %>% 
  filter(format == "CSV")

And now with everything

all_items %>% pull(uuid) %>% 
  map_df(
    ~(fromJSON(str_c("https://loar.kb.dk/rest/items/", .x, "/bitstreams")) %>% select(name,sizeBytes, format, uuid))
) %>% 
  filter(format == "CSV") %>% 
  select(-format) -> all_bitstreams

Let’s have a look

all_bitstreams
summary(all_bitstreams)
##      name             sizeBytes             uuid          
##  Length:142         Min.   :    23490   Length:142        
##  Class :character   1st Qu.: 17376786   Class :character  
##  Mode  :character   Median : 46032351   Mode  :character  
##                     Mean   : 66204283                     
##                     3rd Qu.: 86858365                     
##                     Max.   :365887109

So at last, we can get the answer to the question regarding the resources needed for downloading the complete collection. But, first lets load a library for formatting numbers in a more human readable way

library(gdata)

Sum all the bytes

all_bitstreams %>% 
  summarise(total_bytes = humanReadable(sum(as.numeric(sizeBytes)), standard = "SI"))

That’s more or less that same as intalling TeX Live ;-)

Last example: Look at some text

Let’s select a year: 1853. Let’s get all the available text from that year. Now, we can cheat a bit, as we have all the bitstreams, and we can filter their names for 1853

all_bitstreams %>% 
  filter(str_detect(name, "1853.csv"))

Let’s get that bitstream using the GET /bitstreams/{bitstream id}/retrieve

print(now())
## [1] "2020-05-27 13:23:00 CEST"
articles_1853 <- read_delim(
  "https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve",
  delim = ",",
  escape_backslash = TRUE,
  escape_double = FALSE)
## Parsed with column specification:
## cols(
##   recordID = col_character(),
##   sort_year_asc = col_date(format = ""),
##   editionId = col_character(),
##   newspaper_page = col_double(),
##   fulltext_org = col_character()
## )
## Warning: 34 parsing failures.
##  row           col   expected actual                                                                               file
## 5889 sort_year_asc date like    1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## 5890 sort_year_asc date like    1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## 7560 sort_year_asc date like    1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## 7561 sort_year_asc date like    1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## 7562 sort_year_asc date like    1853 'https://loar.kb.dk/rest/bitstreams/f6543ed8-d4ba-40fe-99a8-ba26a5390924/retrieve'
## .... ............. .......... ...... ..................................................................................
## See problems(...) for more details.
print(now())
## [1] "2020-05-27 13:23:29 CEST"

That took less that a minute, so…

Okay, what did we get

articles_1853