How to count three consecutive records with same values in a column based on the value of another column in PostgreSQL?

I have a table attendance as follows:

create table attendance (id int, name varchar(30), status varchar(10), date date)  

This table has the following records:

insert into attendance values (1,'John','absent','2016-03-01');
insert into attendance values (1,'John','absent','2016-03-02');
insert into attendance values (1,'John','absent','2016-03-03');
insert into attendance values (2,'Sam','present','2016-03-04');
insert into attendance values (3,'Sam','absent','2016-03-05');
insert into attendance values (1,'John','absent','2016-03-06');
insert into attendance values (1,'John','absent','2016-03-07');
insert into attendance values (1,'John','absent','2016-03-08');
insert into attendance values (1,'John','present','2016-03-09');
insert into attendance values (1,'John','absent','2016-03-10');
insert into attendance values (1,'John','absent','2016-03-11');
insert into attendance values (1,'John','present','2016-03-12'); 
insert into attendance values (1,'John','absent','2016-03-13');   

Now I want to count that how many times a person has three consecutive absent records.
The result should be as follows:

id name count
 1  John   2  

because John is absent for three consecutive days two times.

If they have 6 absents on 6 consecutive dates, then it should count 2 i.e (3+3).

It should count 1 for 3 consecutive absents on 3 consecutive dates i.e if John is absent for 1,2 and 3 march, it should count 1. but if John is absent for 1,2 and 4 march then it should not count 1.
4 or 5 absents should be counted as 1.

If there is no entry for a date, It is considered 'Present'.

Any help is appreciated.

Answers 3

  • Another solution. Recursive part creates groups of rows of status "absent". The solution uses recursive CTEs and window functions.

    WITH RECURSIVE a(id, name, date, n)  as (
        SELECT id, name, q.date, 1 as n
        FROM (
            SELECT id, name, date,
                date-lag(date) OVER (PARTITION BY name ORDER BY date) as lag
            FROM  attendance
            WHERE status='absent'
        ) q
        WHERE lag >1 or lag is null
        UNION
        SELECT a.id, a.name, a.date, a.n + 1
        FROM a
        JOIN attendance at ON (
               a.id = at.id and at.name = a.name and at.date = a.date + n)
        WHERE at.status='absent'
    )
    SELECT id, name, sum(long_absences) FROM (
        SELECT id, name, count(*)/3 as long_absences
        FROM a
        GROUP BY id, name, date having count(*) >=3
        ) as absences
    GROUP BY id, name;
    

  • I beleive @user103153 solution is better so I'll just add this as an example of another technique. I did not have PostgreSQL to try with so some adjustments likely have to be made. I also use a calendar table, in psql either a recursive CTE or a generating expression can be used:

    We start by constructing the domain of names and dates:

    select distinct d, name 
    from calendar c 
    cross join attendance a
    

    from there we can construct the status:

    select nc.name, nc.d, a.date, a.name as aname
             , coalesce(a.status, 'present') as status        
    from ( 
        select distinct d, name 
        from calendar c 
        cross join attendance a
    ) as nc
    left join attendance a                                                       
        on a.name = nc.name
        and a.date = nc.d
    

    Then by using a common trick we can construct groups of consecutive name, date and status:

    select name, d, status, row_number() over (partition by name order by d) 
                          - row_number() over (partition by name, status order by d) as grp 
    from (
        select nc.name, nc.d, a.date, a.name as aname
             , coalesce(a.status, 'present') as status        
        from ( 
            select distinct d, name 
            from calendar c cross join attendance a
        ) as nc
        left join attendance a                                                       
            on a.name = nc.name
            and a.date = nc.d
    ) 
    

    Next step is to determine the start and stop for each interval where status is 'absent':

    select name, min(d), max(d) from (
        select name, d, status, row_number() over (partition by name order by d) 
                              - row_number() over (partition by name, status order by d) as grp 
        from (
            select nc.name, nc.d, a.date, a.name as aname
                 , coalesce(a.status, 'present') as status        
            from ( 
                select distinct d, name 
                from calendar c 
                cross join attendance a
            ) as nc
            left join attendance a                                                       
                on a.name = nc.name
                and a.date = nc.d
        ) 
    ) 
    where status = 'absent' 
    group by grp, name order by min(d);
    
    NAME                           2          3         
    ------------------------------ ---------- ----------
    John                           03/01/2016 03/03/2016
    Sam                            03/05/2016 03/05/2016
    John                           03/06/2016 03/08/2016
    John                           03/10/2016 03/11/2016
    John                           03/13/2016 03/13/2016
    

    Finally (left as an exercise :-) we take the length of the interval and integer divide that with 3 and filter out the ones with 0. Then it's just a matter of summing the number per name.


  • I think I cobbled together something that does what you're looking for. I used several subqueries to check for consecutive absences, and also to avoid double dipping on a long absence (meaning 6 days in a row should count ONLY the 2 combined marks from the first day, and not the individual marks on absences 2-4, 3-5, and 4-6).

    Would be curious to see if there's a more elegant solution from someone else.

        select 
                z.name, 
                sum(z.absent_marks) 
        from (
                select 
                        sub1.* 
                from (
                        select 
                                x.name, 
                                x.adate, 
                                floor(max(x.timeshift)/3)::integer as absent_marks
                        from (
                                 select 
                                        a.name,
                                        a.date as adate,
                                        b.date as bdate,
                                        dense_rank() over (partition by a.name, a.date order by b.date) as timeshift
                                from attendance a
                                left join attendance b 
                                on a."name" = b."name" 
                                and a.date <= b.date
                                and a.status = b.status 
                                where a.status = 'absent'
                        ) x 
                        where x.timeshift - 1 = datediff('day', x.adate, x.bdate)
                        group by x.name, x.adate
                ) sub1
                left join (                
                        select 
                                x.name, 
                                x.adate, 
                                floor(max(x.timeshift)/3)::integer as absent_marks
                        from (
                                 select 
                                        a.name,
                                        a.date as adate,
                                        b.date as bdate,
                                        dense_rank() over (partition by a.name, a.date order by b.date) as timeshift
                                from attendance a
                                left join attendance b 
                                on a."name" = b."name" 
                                and a.date <= b.date
                                and a.status = b.status 
                                where a.status = 'absent'
                        ) x 
                        where x.timeshift - 1 = datediff('day', x.adate, x.bdate)
                        group by x.name, x.adate) sub2
                on sub1.adate = sub2.adate + 1
                and sub1.name = sub2.name 
                where (sub2.absent_marks = 0 or sub2.absent_marks is null)
                and sub1.absent_marks != 0
                ) z
        group by
                z.name
    

Related Questions