Dealing with Images in Obsidian — identifying duplicates, checking the extension, quickly resorting them in a gallery

It’s like deja vu all over again.
Unknown

The detection of duplicates by file content is a pretty neat feature if you have lots and lots of images. After all, things get confusing when you reference different image names yet the same image, esp. in Obsidian where backlinks can be really useful.

One of the great features of DEVONthink, which I loved when it still listened to its customers, is to identify duplicates. It does not go by filename, but by content. For example, if you have the same image saved under different names, e.g., imageA.jpg and imageB.jpg, it will highlight them if they are identical. If you do not want to support DEVONthink (or similar apps), there is a way to identify duplicates by creating hashes for each image and using these hashes to identify duplicates.

Essentially, a hash is simply a combination of letters and number that is derived from a file. They are used among others to check whether a file you want to download was changed (e.g., md5 hash). For our purposes the important thing is: Hashes are different if the files are different and the same if the files are the same. And there is no difference between ensuring that a file that is offered as download is still the same when you downloaded it and identifying images with the same content (or any other files). (Update: Note that there is an issue called «hash collision» (thank’s ChatGPT) in which different files can have the same hash. However, this is highly unlikely to happen if you do not have millions of images. Other methods of calculating hashes (e.g., SHA-256 or SHA-3) can further reduce the risk. As the original and the duplications are shown on a page together, a hash collusion would be immediately noticeable.)

Using R — and some support by ChatGPT — the following script worked pretty well to identify about 20k+ duplicates in my input folder (yep, digital squirrel). It’s the obs.checkAndUpdateImages function and it creates/overwrite the file Duplicated Images in Obsidian.md with all images that occur at least twice (adapt the PATHINFORMATION). Not that is adds the duplicated version and the first version of the image. So you cannot simply delete those images, you have to change the file name of the duplicated images to the first image. And yeah, I could automate it, but I’d rather do that myself for the moment.

It also deals with two other issues.

I had a few images with the wrong extensions (e.g., a jpg-image with the png extension). The check_image_signature function (called by obs.checkAndUpdateImages) checks whether the content and the extension matches (for images only).

As I collect a lot of images on an Obsidian page, I use Gallery headers.

<i>I:</i> [[#Landscape]] - [[#Long Landscape|Long]] | [[#Portrait]] - [[#Long Portrait|Long]] | [[#Square]]</code>
<h1>Images</h1>
<h2>Landscape</h2>
<h2>Long Landscape</h2>
<h2>Portrait</h2>
<h2>Long Portrait</h2>
<h2>Square</h2>

The images are sorted by aspect ratio, making it easier to scroll through them. Using |330 I put two next to each other, followed by an empty line. As this is tedious to do manually, the obs.sortGallery function takes the measures (width and length, and the calculated aspect ratio) and orders them in the different sections (sorted alphabetically in each section). As R might have problems with Umlaute and other special characters (it then drops the images), I copy-paste the whole Obsidian page to the Image Gallery Resort Temp.md page. That page is read and resorted. Just be careful to move to a different page while R does the sorting. Otherwise Obsidian and R might fight for the file, with Obsidian winning.

BTW, the R script uses <<- to assign the file with the image information. This way it gets read once. If it then still exists in the global R environment, the next time you sort a gallery it does not reload the image information again. Just remember that you have to update the image information when you add new files into Obsidian or move where they are in the folder structure. Given that different images might have the same name when they are in different directories, the script uses the path and filename to identify the images. Thus if you move images into another folder, you have to update the image information.

As usual, no warranty.

```
library(tidyverse)
library(magick)
library(jpeg)
library(png)
library(digest)

compute_hash <- function(file_path) {
    digest(file = file_path, algo = "md5")
}

check_image_signature <- function(image_path) {
    ext <- tolower(tools::file_ext(image_path))
    
    # Open the file in binary mode and read the first few bytes
    con <- file(image_path, "rb")
    on.exit(close(con))
    signature <- readBin(con, "raw", 8)
    
    # Define signatures for common formats
    signatures <- list(
        jpg = as.raw(c(0xFF, 0xD8, 0xFF)),               # JPEG
        png = as.raw(c(0x89, 0x50, 0x4E, 0x47)),         # PNG
        gif = as.raw(c(0x47, 0x49, 0x46, 0x38))          # GIF
    )
    
    # Match signature to the file extension
    is_match <- switch(ext,
                       jpg = all(signature[1:3] == signatures$jpg),
                       jpeg = all(signature[1:3] == signatures$jpg),
                       png = all(signature[1:4] == signatures$png),
                       gif = all(signature[1:4] == signatures$gif),
                       FALSE
    )
    
    return(is_match)
}

obs.checkAndUpdateImages <- function() {
    allMediaData <- read_csv("~/PATHINFORMATION/allImagesData.csv")
    write_csv(allMediaData, "~/PATHINFORMATION/allImagesData_temp.csv")
    readImageData <- tibble(filenameAndPath = list.files( "~/PATHINFORMATION", pattern = "\\.(png|jpg|gif)$", recursive = TRUE, full.names = TRUE)) %>%
        mutate(namesItself = str_trim(basename(filenameAndPath)))
    tempData <- readImageData %>% left_join(allMediaData, join_by(filenameAndPath == filenameAndPath, namesItself == namesItself))
    alreadyDoneData <- tempData %>% filter(done)
    tempData <- tempData %>% filter(!done | is.na(done))
    
    for (i in 1:nrow(tempData)) {
        filenameAndPath <- tempData[[i, "filenameAndPath"]]
        imageMeasures <- getImageMeasures(filenameAndPath)
        if (is.null(imageMeasures)) next
        tempData[[i, "done"]] <- TRUE
        tempData[[i, "width"]] <- imageMeasures$width
        tempData[[i, "height"]] <- imageMeasures$height
        tempData[[i, "aspect_ratio"]] <- imageMeasures$aspect_ratio
        tempData[[i, "category"]] <- imageMeasures$category
        tempData[[i, "correctExt"]] <- check_image_signature(filenameAndPath)
        tempData[[i, "hashSig"]] <- compute_hash(filenameAndPath)

        if (i %% 10 == 0) cat(".")
        if (i %% 500 == 0) {
            cat("\n", i, "done\n")
        }
    }
    
    allMediaData <- bind_rows(alreadyDoneData, tempData)
    allMediaData$dups <- duplicated(allMediaData$hashSig)
    listOfAllDups <- unique(allMediaData$hashSig[allMediaData$dups])
    duplicatedImages <- allMediaData %>% filter(hashSig %in% listOfAllDups) %>% arrange(hashSig, desc(filenameAndPath))
    if(length(duplicatedImages) > 0) { warning("Duplicated Images, see Duplicated Images in Obsidian File.") }
<div class="mceTemp"><dl id="" class="wp-caption aligncenter" style="width: 550px"><dt class="wp-caption-dt"><a href="https://www.organizingcreativity.com/wp-content/gallery/blog_2024/____write_lines(paste0("",_duplicatedimages$namesitself,_"|330\n\n"),_"~/pathinformation/duplicated_images_in_obsidian.md")" target="_blank" rel="noopener"><img class="ngg-singlepic ngg-center" src="https://www.organizingcreativity.com/wp-content/gallery/blog_2024/____write_lines(paste0("",_duplicatedimages$namesitself,_"|330\n\n"),_"~/pathinformation/duplicated_images_in_obsidian.md")" alt="" width="550" /></a></dt><dd class="wp-caption-dd">Click on image to enlarge.</dd></dl></div>
    
    write_csv(allMediaData, "~/PATHINFORMATION/allImagesData.csv")
}

format_images_with_spacing <- function(images) {
<div class="mceTemp"><dl id="" class="wp-caption aligncenter" style="width: 550px"><dt class="wp-caption-dt"><a href="https://www.organizingcreativity.com/wp-content/gallery/blog_2024/____images_<-_paste0("",_images,_"|330")" target="_blank" rel="noopener"><img class="ngg-singlepic ngg-center" src="https://www.organizingcreativity.com/wp-content/gallery/blog_2024/____images_<-_paste0("",_images,_"|330")" alt="" width="550" /></a></dt><dd class="wp-caption-dd">Click on image to enlarge.</dd></dl></div>
    image_pairs <- split(images, ceiling(seq_along(images) / 2))
    sapply(image_pairs, paste, collapse = " ") %>% paste(collapse = "\n\n")
}


obs.sortGallery <- function() {
    if(!exists("allMediaData")) {
        allMediaData <<- read_csv("~/PATHINFORMATION/allImagesData.csv")
        print("allImagesData.csv loaded. Make sure the Image Gallery Resort Temp File is NOT OPEN!")
    } else { 
        print("Using loaded allMediaData. Make sure the Image Gallery Resort Temp File is NOT OPEN!")
        }
    md_file_path <- "~/PATHINFORMATION/Image Gallery Resort Temp.md"
    content <- readLines(md_file_path)
    
    start_images <- which(grepl("^# Images", content))
    end_square <- which(grepl("^## Square", content))
    if (!(start_images > 0 && end_square > 0)) stop("Headers not found")
    
    # Determine extraction range
    next_main_header <- which(grepl("^#", content) & seq_along(content) > end_square)
    desired_content <- content[start_images:(ifelse(length(next_main_header) > 0, next_main_header[1], length(content)) - 1)]
    preText <- content[1:start_images]
    endText <- if (length(next_main_header) > 0) content[next_main_header:length(content)] else "# eof"
    
    image_links <- str_extract_all(desired_content, "!\\[\\[([\\w\\säöüÄÖÜßàèéçâêîôûëïüÿñçãõÁÉÍÓÚáéíóúâêîôûëïüÿñçãõ._-]+\\.(png|jpg|gif))(\\|\\d+)?\\]\\]") %>% unlist()
    
    link_images <- tibble(filenames = image_links) %>%
        mutate(namesItself = str_trim(str_extract(filenames, "(?<=!\\[\\[)[^|\\]]+\\.(png|jpg|gif)")))
    
    # Join and filter existing images in the directory
    page_images <- allMediaData %>%
        inner_join(link_images, by = "namesItself") %>%
        distinct(namesItself, .keep_all = TRUE)
    
    if(nrow(page_images) != nrow(link_images)) {
        # Find images in link_images that are not in all_images
        dropped_images <- anti_join(link_images, page_images, by = "namesItself")
        
        # Display the dropped images
        print(dropped_images$namesItself)
        
        stop("Images dropped.")
    }
    
    page_images <- page_images %>%
        mutate(aspect_ratio2 = factor(category, levels = c("Landscape", "Long Landscape", "Portrait", "Long Portrait", "Square"))) %>%
        arrange(aspect_ratio2, namesItself)
    
    
    # Generate formatted output by category
    formatted_output <- page_images %>%
        group_by(aspect_ratio2) %>%
        summarize(images = format_images_with_spacing(namesItself), .groups = "drop") %>%
        mutate(content = str_c("\n## ", aspect_ratio2, "\n", images, "\n")) %>%
        pull(content)
    
    # Assemble final output and write to file
    output <- c(preText, formatted_output, endText)
    writeLines(output, "~/PATHINFORMATION/Image Gallery Resort Temp.md")
}

getImageMeasures <- function(image_path) {
    # Initialize dims as NULL to detect if all reading attempts fail
    dims <- NULL
    tryCatch({
        ext <- tolower(tools::file_ext(image_path))
        
        # Attempt to read based on file extension, with nested tryCatch for each format
        if (ext == "jpg" || ext == "jpeg") {
            dims <- tryCatch({
                dim(readJPEG(image_path, native = TRUE))
            }, error = function(e) NULL)
        }
        if (is.null(dims) && ext == "png") {
            dims <- tryCatch({
                dim(readPNG(image_path, native = TRUE))
            }, error = function(e) NULL)
        }
        if (is.null(dims) && ext == "gif") {
            dims <- tryCatch({
                img <- magick::image_read(image_path)
                c(image_info(img)$width[1], image_info(img)$height[1])
            }, error = function(e) NULL)
        }
        # Fallback for unknown formats or format mismatches
        if (is.null(dims)) {
            dims <- tryCatch({
                img <- magick::image_read(image_path)
                c(image_info(img)$width[1], image_info(img)$height[1])
            }, error = function(e) NULL)
        }
    }, error = function(e) {
        message("Error reading image dimensions for: ", image_path, " - ", e$message)
    })
    
    # If dimensions could not be obtained, return NULL
    if (is.null(dims) || length(dims) != 2) return(NULL)
    
    # Calculate aspect ratio
    width <- dims[2]
    height <- dims[1]
    aspect_ratio <- width / height
    
    # Categorize based on aspect ratio
    category <- dplyr::case_when(
        aspect_ratio <= 1.1 & aspect_ratio >= 0.9 ~ "Square",
        aspect_ratio > 1.1 & aspect_ratio <= 1.6 ~ "Landscape",
        aspect_ratio > 1.6 ~ "Long Landscape",
        aspect_ratio < 0.9 & aspect_ratio >= 0.6 ~ "Portrait",
        aspect_ratio < 0.6 ~ "Long Portrait",
        TRUE ~ "Unknown"
    )
    
    return(list(width = width, height = height, aspect_ratio = aspect_ratio, category = category))
}
```

Note that it reads in a allImagesData.csv file. To avoid having to calculate the hashes again and again, the information is saved in that .csv file. If it does not exist just create an empty one (not sure, but should work). You also need to adapt the PATHINFORMATION to your file and folder structure.

Interesting what is possible with R (and yeah, I could use the same principle to identify duplicated articles in my Library … hmmm). 🙂
(And Kudos to ChatGPT, I did not know a few image functions and it was really good in providing them.)