Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF files detected as plain/text #12

Open
ofthelit opened this issue Dec 14, 2015 · 1 comment
Open

PDF files detected as plain/text #12

ofthelit opened this issue Dec 14, 2015 · 1 comment

Comments

@ofthelit
Copy link

This example PDF file gets detected as text/plain when MaxHeaderSize first bytes are used for the detection: http://www.orimi.com/pdf-test.pdf

I would run the file signature detection before checking for plain text files.

public static FileType GetFileType(Func<byte[]> fileHeaderReadFunc, string fileFullName = "")
{
    // if none of the types match, return null
    FileType fileType = null;

    // read first n-bytes from the file
    byte[] fileHeader = fileHeaderReadFunc();

    // compare the file header to the stored file headers
    foreach (FileType type in types)
    {
        int matchingCount = GetFileMatchingCount(fileHeader, type);
        if (matchingCount == type.Header.Length)
        {
            // check for docx and xlsx only if a file name is given
            // there may be situations where the file name is not given
            // or it is unpracticable to write a temp file to get the FileInfo
            if (type.Equals(ZIP) && !String.IsNullOrEmpty(fileFullName))
                fileType = CheckForDocxAndXlsx(type, fileFullName);
            else
                fileType = type;    // if all the bytes match, return the type

            break;
        }
    }

    if (fileType == null)
    {
        // nothing found yet; maybe just plain text?
        // checking if it's binary (not really exact, but should do the job)
        // shouldn't work with UTF-16 OR UTF-32 files
        if (!fileHeader.Any(b => b == 0))
        {
            fileType = TXT;
        }

        // this would be the place to add detection based on file extension e.g. .csv

    }

    return fileType;
}
@RouR
Copy link

RouR commented Mar 2, 2016

Tnx, my tests is green after this patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants